Introduction

Logistic regression is a fundamental statistical analysis for data science and analytics. It part of a class of modeling techniques known as classification models since they are trying to predict categorical target variables. Classificaiton is one of, if not the, most common type of business problems that need solving. Every field from needs to solve these kinds of problems. Some examples include target marketing, churn prediction, probability of default, and fraud detection to just name a few.

Logistic regression is a form of supervised classification since the target variable is known. This target variable can be binary, ordinal, or even nominal in its structure. This markdown file contains information on how to perform logistic regression and all of its components in SAS and R. In each section you are able to toggle back and forth between the code and output you desire to view.

The first thing that we need to do is load up all of the needed libraries in R that we will be using. This isn’t needed for the SAS sections of the code.

Binary Logistic Regression

The primary focus will be binary logistic regression. It is the most common type of logistic regression, and sets up the foundation for both ordinal and nominal logistic regression.

We will be using the birthweight data set to model the association between various factors and a child being born with low birth weight (<2.5kg). The variables in the data set are the following:

Birth Weight Data Dictionary
Variable Description
age mother’s age (years)
lwt mother’s weight at last menstrual period (lbs)
smoke mother’s smoking status during pregnancy
race mother’s race (1 = White, 2 = Black, 3 = Other)
ptl number of premature labors
ht history of hypertension
ui uterine irritability
ftv number of physician visits during first trimester

The linear probability model is not as widely used since in probabilities do not tend to follow the properties of linearity in relation to their predictors. Also, the linear probability model possibly produces predictions outside of the bounds of 0 and 1 (where probabilities should be!). For completeness sake however, here is the linear probability model:

Linear Probability Model

Let’s first view what a linear probability model would look like plotted on our data and then we can build the model.

SAS

proc sgplot data=logistic.lowbwt;
   reg x = age y = low;
   title 'OLS for Binary Response?';
   xaxis label='Mother`s Age';
   yaxis label='Low Birth Weight';
run;

The SGPlot Procedure


Even though it doesn’t appear to really look like our data, let’s fit this linear probability model anyway for completeness sake.

proc reg data=logistic.lowbwt plot(unpack)=diagnostics;
   model low = age ht lwt ptl smoke ui;
run;
quit;
Model: MODEL1
Dependent Variable: low

Number of Observations Read 189
Number of Observations Used 189

Analysis of Variance
Source DF Sum of
Squares
Mean
Square
F Value Pr > F
Model 6 5.41117 0.90186 4.67 0.0002
Error 182 35.17085 0.19325
Corrected Total 188 40.58201

Root MSE 0.43960 R-Square 0.1333
Dependent Mean 0.31217 Adj R-Sq 0.1048
Coeff Var 140.82038

Parameter Estimates
Variable DF Parameter
Estimate
Standard
Error
t Value Pr > |t|
Intercept 1 0.67981 0.19113 3.56 0.0005
age 1 -0.00719 0.00622 -1.16 0.2492
ht 1 0.38540 0.13578 2.84 0.0050
lwt 1 -0.00244 0.00112 -2.18 0.0305
ptl 1 0.12527 0.06875 1.82 0.0701
smoke 1 0.10953 0.06683 1.64 0.1030
ui 1 0.16040 0.09377 1.71 0.0889



Model: MODEL1
Dependent Variable: low

Histogram of residuals for low with normal and kernel densities overlaid.


Scatter plot of residuals by predicted values for low.


Scatter plot of studentized residuals (RSTUDENT) by predicted values for low.


Scatter plot of observed by predicted values for low.


Needle plot of Cook's D statistic by observation for low.


Scatter plot of studentized residuals (RSTUDENT) by leverage for low.


Q-Q plot of residuals for low.


Residual-Fit spread plot for low. This plot consists of two side-by-side plots that show the spread in the fitted values about their mean and the spread in the residuals respectively.


Scatter plot of residuals by age for low.


Scatter plot of residuals by ht for low.


Scatter plot of residuals by lwt for low.


Scatter plot of residuals by ptl for low.


Scatter plot of residuals by smoke for low.


Scatter plot of residuals by ui for low.


As we can see from the charts above, the assumptions of ordinary least squares don’t really hold in this situation. Therefore, we should be careful interpreting the results of the model. Maybe a better model won’t have these problems?

Note that PROC REG in SAS does not have a CLASS statement for ease of using categorical predictors so the race variable has been left out of this model.

R

Warning in abline(lp.model): only using the first two of 7 regression
coefficients

Even though it doesn’t appear to really look like our data, let’s fit this linear probability model anyway for completeness sake.


Call:
lm(formula = low ~ age + ht + lwt + ptl + factor(smoke) + ui, 
    data = bwt)

Residuals:
    Min      1Q  Median      3Q     Max 
-0.9144 -0.2947 -0.1878  0.4289  0.8797 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)     0.679809   0.191132   3.557 0.000479 ***
age            -0.007193   0.006222  -1.156 0.249188    
ht              0.385398   0.135780   2.838 0.005049 ** 
lwt            -0.002435   0.001117  -2.180 0.030512 *  
ptl             0.125268   0.068754   1.822 0.070102 .  
factor(smoke)1  0.109533   0.066833   1.639 0.102961    
ui              0.160401   0.093774   1.711 0.088876 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4396 on 182 degrees of freedom
Multiple R-squared:  0.1333,    Adjusted R-squared:  0.1048 
F-statistic: 4.667 on 6 and 182 DF,  p-value: 0.0001935

As we can see from the charts above, the assumptions of ordinary least squares don’t really hold in this situation. Therefore, we should be careful interpreting the results of the model. Maybe a better model won’t have these problems?

Note the factor function on the variable smoke. This creates dummy variable representations in the modeling functions of R to help with the use of categorical variables.

Binary Logistic Regression Model

Due to the limitations of the linear probability model, people typically just use the binary logistic regression model. Let’s see how to run binary logistic regression in each of our softwares!

SAS

PROC LOGISTIC has a built in CLASS statement. This allows us to input categorical variables easily into our logistic regression. The race variable had three categories - White, Black, and Other. We will use the White as our reference category as you can see with the ref = option listed after the race variable. By default, SAS uses effects coding for all categorical variables. To switch to reference coding we use the param = ref option at the end of the CLASS statement.

For our odds ratios as well as parameter estimates, the profile likelihood estimation method is preferred to the Wald approximation. This is accomplished by using the clodds = pl and clparm = pl options.

proc logistic data=logistic.lowbwt plots(only)=(oddsratio);
   class race(ref='white') / param=ref; 
   model low(event='1') = age race lwt smoke / clodds=pl clparm=pl;
   title 'Modeling Low Birth Weight';
run;
quit;
Modeling Low Birth Weight

Model Information
Data Set LOGISTIC.LOWBWT
Response Variable low
Number of Response Levels 2
Model binary logit
Optimization Technique Fisher's scoring

Number of Observations Read 189
Number of Observations Used 189

Response Profile
Ordered
Value
low Total
Frequency
1 0 130
2 1 59

Probability modeled is low=‘1’.


Class Level Information
Class Value Design Variables
race black 1 0
other 0 1
white 0 0

Model Convergence Status
Convergence criterion (GCONV=1E-8) satisfied.

Model Fit Statistics
Criterion Intercept Only Intercept and
Covariates
AIC 236.672 226.577
SC 239.914 246.028
-2 Log L 234.672 214.577

Testing Global Null Hypothesis: BETA=0
Test Chi-Square DF Pr > ChiSq
Likelihood Ratio 20.0948 5 0.0012
Score 18.6377 5 0.0022
Wald 16.4973 5 0.0056

Type 3 Analysis of Effects
Effect DF Wald
Chi-Square
Pr > ChiSq
age 1 0.4326 0.5107
race 2 7.8419 0.0198
lwt 1 3.8470 0.0498
smoke 1 7.6991 0.0055

Analysis of Maximum Likelihood Estimates
Parameter DF Estimate Standard
Error
Wald
Chi-Square
Pr > ChiSq
Intercept 1 0.3324 1.1077 0.0900 0.7641
age 1 -0.0225 0.0342 0.4326 0.5107
race black 1 1.2316 0.5171 5.6718 0.0172
race other 1 0.9432 0.4162 5.1351 0.0234
lwt 1 -0.0125 0.00639 3.8470 0.0498
smoke 1 1.0544 0.3800 7.6991 0.0055

Association of Predicted Probabilities and
Observed Responses
Percent Concordant 68.4 Somers' D 0.367
Percent Discordant 31.6 Gamma 0.368
Percent Tied 0.0 Tau-a 0.159
Pairs 7670 c 0.684

Parameter Estimates and Profile-Likelihood
Confidence Intervals
Parameter Estimate 95% Confidence Limits
Intercept 0.3324 -1.8092 2.5609
age -0.0225 -0.0909 0.0436
race black 1.2316 0.2206 2.2648
race other 0.9432 0.1401 1.7809
lwt -0.0125 -0.0259 -0.00064
smoke 1.0544 0.3238 1.8222

Odds Ratio Estimates and Profile-Likelihood Confidence Intervals
Effect Unit Estimate 95% Confidence Limits
age 1.0000 0.978 0.913 1.045
race black vs white 1.0000 3.427 1.247 9.629
race other vs white 1.0000 2.568 1.150 5.935
lwt 1.0000 0.988 0.974 0.999
smoke 1.0000 2.870 1.382 6.186

Plot of Odds Ratios with 95% Profile-Likelihood Confidence Limits


Let’s examine the output above. Scanning down the output, you can see that SAS provided us with a summary of the number of observations read and used in the analysis as well as a frequency breakdown of the target variable.

The important piece of the output here is the model convergence status. Note that the model converged. This estimation is done through maximum likelihood estimation (MLE) so we need this convergence. We will address what happens when this doesn’t converge in the Data Considerations section.

SAS then provides some basic model metrics like the AIC and SC as well as the Global Likelihood Ratio Test. The small p-value here (we used an 0.02 significance level for this analysis based on the sample size) implies that at least one of the variables is useful in the prediction of low birth weight. The Type 3 Analysis table provides us with a variable by variable breakdown to see which of the variables prove helpful. At our significance level it appears that the variables age and lwt are not significant. We will address variable selection in the Subset Selection and Diagnostics section.

The MLE table provides us with the actual logistic regression equation itself for each of the variables, including the design variables created for race with the CLASS statement.

The Association of Predicted Probabilities table provides us with many model metrics that we will discuss in the Model Assessment section.

Lastly we are provided with the odds ratios for each of the above mentioned variables. Odds ratios are a great way of interpreting the output from a logistic regression. For example, let’s look at the smoke variable odds ratio. This tells us that mothers who smoke are 2.87 times more likely (in terms of odds) of having low birth weight babies, on average. To practice interpreting continuous variables, let’s look at the age variable even though it isn’t significant. Every year older that a mother is leads to her being 0.978 times more likely to have a low birth weight baby. This maybe a little harder to easily undertstand. Another way to put this would be that every yea older that a mother is reduces her odds of having a low birth weight baby by 1.2% (100*(1 - 0.978)) on average.

R

The glm function in R will provide us the ability to model binary logistic regressions. Similar to the lm function, you can specify a model formula. Notice the two factor functions on both race and smoke. This will convert these categorical variables into useable design variables for the modeling function.

The family = binomial(link = "logit") option is there to specify that we are building a logisitc model. Generalized Linear models (GLM) are a general class of models where logistic regression is a special case where the link function is the logit function. Use the summary function to look at the necessary output.


Call:
glm(formula = low ~ age + lwt + factor(smoke) + factor(race), 
    family = binomial(link = "logit"), data = bwt)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.5173  -0.9065  -0.5865   1.3035   2.0401  

Coefficients:
                   Estimate Std. Error z value Pr(>|z|)   
(Intercept)        1.564123   1.167005   1.340  0.18015   
age               -0.022478   0.034170  -0.658  0.51065   
lwt               -0.012526   0.006386  -1.961  0.04982 * 
factor(smoke)1     1.054439   0.380000   2.775  0.00552 **
factor(race)other -0.288409   0.526756  -0.548  0.58402   
factor(race)white -1.231671   0.517152  -2.382  0.01724 * 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 234.67  on 188  degrees of freedom
Residual deviance: 214.58  on 183  degrees of freedom
AIC: 226.58

Number of Fisher Scoring iterations: 4

Let’s examine the output above. Scanning down the output, you can see the actual logistic regression equation itself for each of the variables, including the design variables created for race and smoke with the factor functions. Unfortunately, be default R doesn’t report the Type 3 Analysis results that display the overall importance of these categorical variables. We will produce that with the next piece of code below. At our significance level (we used an 0.02 significance level for this analysis based on the sample size) it appears that the variables age and lwt are not significant. We will address variable selection in the Subset Selection and Diagnostics section.

A lot of the output produced in SAS by default must be requested in R. For exmaple, if you want more model metrics on how good the model performed, see the Model Assessment section of the document.

To perform a global Likelihood Ratio Test to see if any of the variables are significant overall you can run the following code.

Analysis of Deviance Table

Model 1: low ~ age + lwt + factor(smoke) + factor(race)
Model 2: low ~ 1
  Resid. Df Resid. Dev Df Deviance Pr(>Chi)   
1       183     214.58                        
2       188     234.67 -5  -20.095   0.0012 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The above code creates a second model without any variables. It then performs a Likelihood Ratio Test (LRT) on this model compared to the original model. The results of this test show the usefulness of all the variables. The two models are significantly different (based on our significance level) which implies that at least one of the variables is significant. If these models weren’t significantly different then the variables would be adding no value statistically.

Categorical variables should not be evaluated solely based on their individual categorical design variable significance as those represent specific tests between combinations of categories. The variable overall must be evaluated for its usefulness to the model.

Analysis of Deviance Table

Model 1: low ~ age + lwt + factor(smoke) + factor(race)
Model 2: low ~ age + lwt + factor(smoke)
  Resid. Df Resid. Dev Df Deviance Pr(>Chi)  
1       183     214.58                       
2       185     222.88 -2  -8.3021  0.01575 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The above code creates a second model without the race variable. It then performs a LRT on this model compared to the original model. The results of this test show the usefulness of the race variable. The two models are significantly different (based on our significance level) which implies that the race variable (the only difference between the two models) is significant as a whole. If these models weren’t significantly different then the race variable would be adding no value statistically.

Lastly we can get the odds ratios for each of the above mentioned variables with the following code:

                                2.5 %     97.5 %
(Intercept)       4.7784821 0.5088761 50.9670681
age               0.9777725 0.9131073  1.0445960
lwt               0.9875525 0.9744679  0.9993613
factor(smoke)1    2.8703634 1.3823204  6.1857015
factor(race)other 0.7494552 0.2652201  2.1245479
factor(race)white 0.2918045 0.1038416  0.8020311

Odds ratios are a great way of interpreting the output from a logistic regression. For example, let’s look at the smoke variable odds ratio. This tells us that mothers who smoke are 2.87 times more likely (in terms of odds) of having low birth weight babies, on average. To practice interpreting continuous variables, let’s look at the age variable even though it isn’t significant. Every year older that a mother is leads to her being 0.978 times more likely to have a low birth weight baby. This maybe a little harder to easily undertstand. Another way to put this would be that every yea older that a mother is reduces her odds of having a low birth weight baby by 1.2% (100*(1 - 0.978)) on average.

Testing Assumptions

Outside of independence of observations, the biggest assumption of logistic regression is that the continuous predictor variables are linearly related to the logit function (our transformation of the probability of our target). There are two common ways to test for these:

  1. Box Tidewell Transformation Approach
  2. Generalized Additive Models (GAMs)

Let’s see how to produce these in each of our softwares!

SAS

First let’s look at the Box-Tidwell approach. To perform the Box-Tidwell approach in SAS you first need to create new transformations on the continuous variables. Each of the continuous variables needs to be transformed as variable*log(variable) where the log() is the natural log. These transformed variables need to be added back into the model along with the original continuous variables. The key is the significance of these variables. The ods html select ModelANOVA statement below only selects the Type 3 Analysis table from the PROC LOGISTIC since that is the only thing we are interested in.

data lowbwt;
   set logistic.lowbwt;
   aloga = age*log(age);
   llogl = lwt*log(lwt);
run;

ods html select ModelANOVA;
proc logistic data=lowbwt plots(only)=(oddsratio);
   class race(ref='white') / param=ref; 
   model low(event='1') = age race lwt smoke aloga llogl / clodds=pl clparm=pl;
   title 'Modeling Low Birth Weight';
run;
quit;
Modeling Low Birth Weight

Type 3 Analysis of Effects
Effect DF Wald
Chi-Square
Pr > ChiSq
age 1 1.0684 0.3013
race 2 7.7573 0.0207
lwt 1 0.0358 0.8499
smoke 1 7.3098 0.0069
aloga 1 1.1036 0.2935
llogl 1 0.0176 0.8945

Since the two variables aloga and llogl (the transformed versions of our continuous variables) are insignificant, then the assumption of linearty holds. Neither of these two variables are better represented in this model by raising them to an exponent other than one. If either one of them was significant, then raising them to an exponent other than 1 would be a better fit for the model (the linearity assumption would fail).

Now let’s look at the GAMs. For this we can use PROC GAM. Similar to PROC LOGISTIC, we fit our MODEL statment with our full model. PROC GAM even has a CLASS statement for categorical variables. The only difference is that we will test the continuous variables with splines to see if those splines produce anything significantly better than their linear representations. The spline option is put on every one of the continuous variables. We will use df = 4 for each of these allowing the spline to go up to three bends. We want to leave the categorical variables alone so use the param option on those. The dist = and link = options are used to specify the logistic regression.

The ods statment here selects only the significance of the splines (ANODEV option) and the component plots (SmoothingCOmponentPlot option) since that is all we desire to test the assumption.

ods html select ANODEV SmoothingComponentPlot;
proc gam data=logistic.lowbwt plots = components(clm commonaxes);
   class race(ref='white');
   model low(event='1') = spline(age, df=4) spline(lwt, df=4) param(smoke race) / dist = binomial link = logit;
run;
quit;
Dependent Variable: low
Regression Model Component(s): smoke race
Smoothing Model Component(s): spline(age) spline(lwt)

Smoothing Model Analysis
Analysis of Deviance
Source DF Sum of Squares Chi-Square Pr > ChiSq
Spline(age) 3.00000 6.162954 6.1630 0.1039
Spline(lwt) 3.00000 4.187660 4.1877 0.2419

Panel of smoothing component plots for low with 95% confidence bands.


Since both variables are insignificant in their splines, the linearity assumption holds for both of the variables. Neither of these two variables are better represented in this model by raising them to an exponent other than one. If either one of them was significant, then raising them to an exponent other than 1 would be a better fit for the model (the linearity assumption would fail). The plots show this same pattern since the confidence intervals around the splines still contain the horizontal (linear pattern) line. If there was a significant spline then these plots might be able to visually reveal the exponent that the variable should be raised to.

Continuous variables that are raised to exponents can pose problems of interpretation in any regression model. The same holds true for logistic regression. If any of your continuous variables fail the linearity assumption there are two common approaches.

  1. Raise the variable to an exponent other than 1.
  2. Bin the variable into categories.

The first option above makes the variable harder to interpret. However, the second option of binning the continuous variables that fail this assumption works well since the assumption only holds for continuous variables and categorical variables are still relatively easy to interpret.

R

First let’s look at the Box-Tidwell approach. The Box-Tidwell approach in R is much easier to implement than in SAS. Here, the boxTidwell function will take care of everything for you.

    MLE of lambda Score Statistic (z) Pr(>|z|)
age        3.9362             -0.7730   0.4395
lwt       -4.3556              1.0178   0.3088

iterations =  10 

Since the two variables age and lwt are insignificant, then the assumption of linearty holds. Neither of these two variables are better represented in this model by raising them to an exponent other than one. If either one of them was significant, then raising them to an exponent other than 1 would be a better fit for the model (the linearity assumption would fail).

Now let’s look at the GAMs. For this we can use the gam function in R. Similar to the glm function, you have a model formula to specify the full model. Cateogrical variables can still be handled with the factor function. The only difference is that we will test the continuous variables with splines to see if those splines produce anything significantly better than their linear representations. The s function is put on every one of the continuous variables. The family = and link = options are used to specify the logistic regression. The method = option is specified here as their are multiple approaches to fitting the GAM in R. We will use the REML approach.


Family: binomial 
Link function: logit 

Formula:
low ~ s(age) + s(lwt) + smoke + factor(race)

Parametric coefficients:
                  Estimate Std. Error z value Pr(>|z|)   
(Intercept)        -0.5737     0.4616  -1.243   0.2139   
smoke               1.0751     0.3821   2.813   0.0049 **
factor(race)other  -0.3417     0.5297  -0.645   0.5189   
factor(race)white  -1.2642     0.5203  -2.429   0.0151 * 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Approximate significance of smooth terms:
        edf Ref.df Chi.sq p-value  
s(age) 2.07   2.64  1.351  0.5494  
s(lwt) 1.00   1.00  3.764  0.0524 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

R-sq.(adj) =  0.0744   Deviance explained = 9.77%
-REML = 109.46  Scale est. = 1         n = 189

Since both variables are insignificant in their splines (the approximate significance of smooth terms part of the output above), the linearity assumption holds for both of the variables. Neither of these two variables are better represented in this model by raising them to an exponent other than one. If either one of them was significant, then raising them to an exponent other than 1 would be a better fit for the model (the linearity assumption would fail). The plots show this same pattern since the confidence intervals around the splines still contain the horizontal (linear pattern) or linear line. If there was a significant spline then these plots might be able to visually reveal the exponent that the variable should be raised to. In fact, R even estimates the spline for us with the edf piece of the approximate significance of smooth terms part of the output above. The quadratic piece isn’t significant for age and the lwt appears to have an exponent of only 1 as its best fit.

Predicted Values

Obviously, the predicted logit values really don’t help us too much. Instead we want to gather the predicted probabilities of the target variable categories. Luckily, the softwares we are looking at make this rather easy to do.

Let’s see how to produce these predicted probabilities in each of our softwares!

SAS

If you split your data into training and validation (and/or testing) you could score that data set using PROC LOGISTIC’s built in SCORE statement. Here we are creating a new data set of 5 observations to score instead. The SCORE statment has two primary pieces - data = and out =. The data = piece is where you specify the data set you want scored, while the out = piece is the name of the new scored data set.

The ods statment below only selects the effect plot, which will show us the logistic curve across different values of the first continuous variable, here age. It will show each of the categorical variable categories as a separate curve. The other continuous variables are just taken at their average value.

data newbw;
   length race $5;
   input low age race $ lwt smoke;
   datalines;
   1  21 white 110 0
   0  40 black 120 0
   1  31 other 130 1
   0  28 white 140 1
   1  35 black 100 0
   ;
run;

ods html select EFFECTPLOT;
proc logistic data=logistic.lowbwt plots(only)=(oddsratio effect);
   class race(ref='white') smoke(ref='0') / param=ref; 
   model low(event='1') = age race lwt smoke / clodds=pl clparm=pl;
   title 'Modeling Low Birth Weight';
   score data=newbw out=bw_scored;
run;
quit;
   
proc print data=bw_scored;
run;
Modeling Low Birth Weight

Plot of Predicted Probabilities for low=1 by age, sliced by race*smoke, at lwt=129.8.




Modeling Low Birth Weight

Obs race low age lwt smoke F_low I_low P_0 P_1
1 white 1 21 110 0 1 0 0.82015 0.17985
2 black 0 40 120 0 0 0 0.69805 0.30195
3 other 1 31 130 1 1 1 0.49876 0.50124
4 white 0 28 140 1 0 0 0.73028 0.26972
5 black 1 35 100 0 1 0 0.61660 0.38340

The PROC PRINT above looks at the scored data set. Notice that there are four additional columns that we have on our original data set we were scoring. The first two, F_low and I_low, represent the original target variable and the predicted target variable respectively. Specifically, they stand for FROM and INTO. SAS decides the predicted category rather crudely by assuming if the predicted probability of the category is above 0.5, then it predicts that category will occur. We will see in the Model Assessment section that 0.5 is rarely the optimal cut-off.

The final two categories are the predicted probabilities for each of the values of the target variable - P_0 and P_1. These are just the predicted probability of 0 and 1 in our target variable respectively.

R

If you split your data into training and validation (and/or testing) you could score that data set using the predict function. Here we are creating a new data set of 5 observations to score instead. The predict function has three primary pieces - the model object (here called logit.model, the newdata = optoin, and the type = option. The newdata = option is where you specify the data set you want scored, while the type = option specifies the type of prediction you want. Here we want the response option since we are predicting the probabilities. If we wanted the predicted logit values we could use the option terms instead.

The vigreg function below creates the effect plot, which will show us the logistic curve across different values of the first continuous variable, here age. It will show each of the categorical variable categories as a separate curve. The other continuous variables are just taken at a specified value - here their average value.

  age lwt  race smoke      Pred
1  21 110 white     0 0.1798424
2  40 120 black     0 0.3019376
3  31 130 other     1 0.5012475
4  28 140 white     1 0.2697100
5  35 100 black     0 0.3833902

The print function above looks at the scored data set. Notice that we added an additional column to our original data set we were scoring. This is the predicted probability of a 1. This is default in R. If you want the predicted probability of the 0 category, you can just subtract this prediction from 1.

Data Considerations

There are a lot of considerations one should take with data involving a logistic regression. What happens if my event of interest is a rare occurence in my data set? How do categorical variables influence the convergence of my logistic regression model? These are the important things we need to account for in modeling.

Rare Event Sampling

Classification models try to predict categorical target variables, typically were one of the categories is of main interest. What if this category of interest doesn’t happen frequently in the data set? This is referred to as rare-event modeling and exists in many fields such as fraud, credit default, marketing responses, and rare weather events. The typical cut-off people use to decide if their event is rare is 5%. If your event occurs less than 5% of the time, you should adjust your modeling to account for this.

We will be using the telecomm data set data set to model the association between various factors and a customer churning from the company. The variables in the data set are the following:

Telecomm Churn Data Dictionary
Variable Description
account_length length of time with company
international_plan yes, no
voice_mail_plan yes, no
customer_service_calls number of service calls
total_day_minutes minutes used during daytime
total_day_calls calls used during daytime
total_day_charge cost of usage during daytime
Same as previous three for evening, night, international same as above

When you have a rare event situation, you can adjust for this by balancing out the sample in your training data set. THis can be done through sampling techniques typically called oversampling. When oversampling, you can either over or under sample your target variable to get balance.

Oversampling is when you replicate your rare event enough times to balance out with your non-events in your training data sets. There are a variety of different ways to replicate observations, but for now we will only focus on true replication - copying our actual observatinos at random until we get balance. This will inflate our training data set size.

Undersampling is when we randomly sample from the non-events in our training data set only enough observations to balance out with events. This will make our training data set size much smaller. These are both represented in the following figure:

Over vs. Under Sampling

Over vs. Under Sampling

Let’s see how to perform these in each of our softwares!

SAS

In SAS it is rather trivial to replicate observations with the DATA STEP so we will not cover oversampling here. However, randomly undersampling isn’t as trivial. First, we examine the proportional breakdown of our target variable with PROC FREQ. Specifically we want to view the churn variable. We can see below that we have roughly 5% of the customers in our data churn. With PROC SURVERYSELECT we have a few important options. The data = option inputs our data set of interest - here the whole data set. The out = option specifies the data set name that will contain the observations we sampled. The seed = option is important to specify so we can replicate the results and get the same sample repeatedly. The sampsize = option specifies how many observations we want in our undersampled data for each category (event and nonevent). Here we want 100 in each, making our training data set 200 observations with a 50/50 balance of event to non-event observations. We don’t want to use all the event observations in the training set so we can have some still in the validation/testing data set. The outall option is very important. It will keep all the observations from our original data set (from the data = option) in the new data set (from the out = option) with a new variable added to flag which observations have been sampled and which have not. Without this option we would only have the sampled observations in this new data set. Lastly, the STRATA statement specifies the variable we want to stratify our sample on - here the target variable.

proc freq data=logistic.tele_churn;
   table churn;
run;

proc surveyselect data = logistic.tele_churn noprint out=churn_over seed=12345 
                  sampsize=(100 100) outall; 
   strata churn; 
run;

proc freq data=churn_over;
   table churn*selected / norow nopercent;
run;
churn Frequency Percent Cumulative
Frequency
Cumulative
Percent
FALSE 2850 94.87 2850 94.87
TRUE 154 5.13 3004 100.00



Frequency
Col Pct
Table of churn by Selected
churn Selected(Selection Indicator)
0 1 Total
FALSE
2750
98.07
100
50.00
2850
TRUE
54
1.93
100
50.00
154
Total
2804
200
3004

After running the PROC SURVERYSELECT, we can see from the PROC FREQ that we have a new variable called Selected in our data set. There are 200 observations with a value for 1 for Selected (our sampled observations) with a balance of 100 events and 100 non-events. The remaining observations have a value of 0 for the Selected variable. We will use the DATA STEP to split these selected and non-selected observations into our training and validation data set respectively.

data churn_t churn_v;
   set churn_over;
   if selected = 1 then output churn_t;
   else output churn_v;
run;

We now have an undersampled data set to build models on. There are two typical ways to adjust models from over or under sampling - adjusting the intercept and weighting.

Let’s examine this data set and how we can adjust our models accordingly with each technique.

Adjusting the Intercept

One of the ways of adjusting your logistic regression model to account for the sampling to balance the events with non-events is by adjusting the intercept from the logistic regression model. The intercept is what sets the “baseline” probability in your model which is now too high. You can adjust your predicted probabilities from the regression model to account for this incorrect intercept. They can be adjusted with the following equation:

\[ \hat{p}_i = \frac{\hat{p}_i^* \rho_0 \pi_1}{(1-\hat{p}_i^*)\rho_1\pi_0 + \hat{p}_i^* \rho_0 \pi_1}\]

where \(\hat{p}_i^*\) is the unadjusted predicted value, \(\pi_1\) and \(\pi_0\) are the population proportions of your target variable categories (1 and 0), and \(\rho_1\) and \(\rho_0\) are the sample proportions of your target variable categories (1 and 0).

Luckily SAS will do this for us! All we have to do is give SAS the prior proportions from our population. In PROC FREQ we are again looking at the original distribution of the churn variable, but this time recording the proportions of 1’s and 0’s. The out = option saves the frequency analysis to a data set here called priors. In this data set we drop the percent variable with the drop = option and rename the count variable to prior with the ```rename = ```` option.

From there we build our logistic regression in PROC LOGISTIC as we have previously learned in earlier sections. The main difference comes in the SCORE statement. The data = option specifies the data set we want to score (here the validation data set) and the out = option specifies the name of our newly scored data set. The prior = option specifies the name of the data set that contains the prior proportions for the target variable (the priors data set we created from PROC FREQ).

proc freq data=logistic.tele_churn noprint;
   table churn / out=priors(drop=percent rename=(count=_prior_));
run;
        
proc logistic data=churn_t;
   class international_plan(ref='no') voice_mail_plan(ref='no') / param=ref;
   model churn(event='TRUE') = international_plan
                               voice_mail_plan 
                               total_day_charge 
                               customer_service_calls / clodds=pl;
   score data=churn_v prior=priors out=churn_scored1;
run;
quit;
Model Information
Data Set WORK.CHURN_T
Response Variable churn
Number of Response Levels 2
Model binary logit
Optimization Technique Fisher's scoring

Number of Observations Read 200
Number of Observations Used 200

Response Profile
Ordered
Value
churn Total
Frequency
1 FALSE 100
2 TRUE 100

Probability modeled is churn=‘TRUE’.


Class Level Information
Class Value Design
Variables
international_plan no 0
yes 1
voice_mail_plan no 0
yes 1

Model Convergence Status
Convergence criterion (GCONV=1E-8) satisfied.

Model Fit Statistics
Criterion Intercept Only Intercept and
Covariates
AIC 279.259 220.494
SC 282.557 236.985
-2 Log L 277.259 210.494

Testing Global Null Hypothesis: BETA=0
Test Chi-Square DF Pr > ChiSq
Likelihood Ratio 66.7650 4 <.0001
Score 57.7493 4 <.0001
Wald 43.7669 4 <.0001

Type 3 Analysis of Effects
Effect DF Wald
Chi-Square
Pr > ChiSq
international_plan 1 20.3722 <.0001
voice_mail_plan 1 5.9184 0.0150
total_day_charge 1 23.4618 <.0001
customer_service_cal 1 17.8284 <.0001

Analysis of Maximum Likelihood Estimates
Parameter DF Estimate Standard
Error
Wald
Chi-Square
Pr > ChiSq
Intercept 1 -3.9167 0.7369 28.2530 <.0001
international_plan yes 1 2.5347 0.5616 20.3722 <.0001
voice_mail_plan yes 1 -1.0734 0.4412 5.9184 0.0150
total_day_charge 1 0.0875 0.0181 23.4618 <.0001
customer_service_cal 1 0.5081 0.1203 17.8284 <.0001

Association of Predicted Probabilities and
Observed Responses
Percent Concordant 81.7 Somers' D 0.634
Percent Discordant 18.3 Gamma 0.634
Percent Tied 0.0 Tau-a 0.319
Pairs 10000 c 0.817

Odds Ratio Estimates and Profile-Likelihood Confidence Intervals
Effect Unit Estimate 95% Confidence Limits
international_plan yes vs no 1.0000 12.613 4.514 41.742
voice_mail_plan yes vs no 1.0000 0.342 0.138 0.791
total_day_charge 1.0000 1.091 1.055 1.133
customer_service_cal 1.0000 1.662 1.325 2.128

Plot of Odds Ratios with 95% Profile-Likelihood Confidence Limits


This will produce predicted probabilities for our validation data set than have been adjusted to account for the sampling we did to balance out the data set.

Which approach is better? The more common approach is the weighted observation approach. Empirical simulation studies have proven that for large sample sizes (n > 1000), the weighted approach is better. In small sample sizes (n < 1000), the adjusted intercept approach is only better when you correctly specify the model. In other words, if you plan on doing variable selection because you are unsure if you have the correct variables in your model, then your model may be misspecified in its current form until after your perform variable selection. That being the case, it is probably safer to use the weighted approach. This is why most people use the weighted approach.

Weighted Observations

The other technique for adjusting the model for over or under sampling is by building the model with weighted observations so the adjustment happens while building instead of after the fact. In weighting the observations we use weighted MLE instead of plain MLE since each observation doesn’t have the same weight in the estimation of the parameters for our model. The only question now is what are the weights. We derive the weights from the same formula we had for the intercept adjustment. The weight for the rare event is 1, while the weight for the non-event is \(\rho_1\pi_0/\rho_0\pi_1\). For our data set this makes the weights 1 and 18.49 respectively.

We use the DATA STEP to create a weights variable in our training data set with these weights. Next, we build the logistic regression model for our training data as we previously learned in the last section, but with the new WEIGHT statement. In this WEIGHT statement we specify the weights variable for PROC LOGISTIC to use these weights for the weighted MLE.

data churn_t;
   set churn_t;
   weights = 1;
   if churn = 'FALSE' then weights = 18.49;
run;
proc logistic data=churn_t;
   class international_plan(ref='no') voice_mail_plan(ref='no') / param=ref;
   model churn(event='TRUE') = international_plan
                               voice_mail_plan 
                               total_day_charge 
                               customer_service_calls / clodds=pl;
   weight weights;
   score data=churn_v out=churn_scored2;
run;
quit;
Model Information
Data Set WORK.CHURN_T
Response Variable churn
Number of Response Levels 2
Weight Variable weights
Model binary logit
Optimization Technique Fisher's scoring

Number of Observations Read 200
Number of Observations Used 200
Sum of Weights Read 1949
Sum of Weights Used 1949

Response Profile
Ordered
Value
churn Total
Frequency
Total
Weight
1 FALSE 100 1849.0000
2 TRUE 100 100.0000

Probability modeled is churn=‘TRUE’.


Class Level Information
Class Value Design
Variables
international_plan no 0
yes 1
voice_mail_plan no 0
yes 1

Model Convergence Status
Convergence criterion (GCONV=1E-8) satisfied.

Model Fit Statistics
Criterion Intercept Only Intercept and
Covariates
AIC 790.759 655.776
SC 794.058 672.267
-2 Log L 788.759 645.776

Testing Global Null Hypothesis: BETA=0
Test Chi-Square DF Pr > ChiSq
Likelihood Ratio 142.9838 4 <.0001
Score 158.6835 4 <.0001
Wald 119.2408 4 <.0001

Type 3 Analysis of Effects
Effect DF Wald
Chi-Square
Pr > ChiSq
international_plan 1 65.5368 <.0001
voice_mail_plan 1 14.9381 0.0001
total_day_charge 1 53.4961 <.0001
customer_service_cal 1 34.9512 <.0001

Analysis of Maximum Likelihood Estimates
Parameter DF Estimate Standard
Error
Wald
Chi-Square
Pr > ChiSq
Intercept 1 -7.0274 0.5293 176.3076 <.0001
international_plan yes 1 2.2064 0.2725 65.5368 <.0001
voice_mail_plan yes 1 -1.2403 0.3209 14.9381 0.0001
total_day_charge 1 0.0980 0.0134 53.4961 <.0001
customer_service_cal 1 0.4681 0.0792 34.9512 <.0001

Association of Predicted Probabilities and
Observed Responses
Percent Concordant 81.3 Somers' D 0.626
Percent Discordant 18.7 Gamma 0.626
Percent Tied 0.0 Tau-a 0.314
Pairs 10000 c 0.813

Odds Ratio Estimates and Profile-Likelihood Confidence Intervals
Effect Unit Estimate 95% Confidence Limits
international_plan yes vs no 1.0000 9.083 5.282 15.436
voice_mail_plan yes vs no 1.0000 0.289 0.148 0.525
total_day_charge 1.0000 1.103 1.075 1.133
customer_service_cal 1.0000 1.597 1.366 1.864

Plot of Odds Ratios with 95% Profile-Likelihood Confidence Limits


Now we can analyze our results above knowing that the predictions and model have already been adjusted for our sampling because of the weighted approach.

Which approach is better? The more common approach is the weighted observation approach. Empirical simulation studies have proven that for large sample sizes (n > 1000), the weighted approach is better. In small sample sizes (n < 1000), the adjusted intercept approach is only better when you correctly specify the model. In other words, if you plan on doing variable selection because you are unsure if you have the correct variables in your model, then your model may be misspecified in its current form until after your perform variable selection. That being the case, it is probably safer to use the weighted approach. This is why most people use the weighted approach.

R

In R we can use the ubOver and ubUnder function to perform both over and under sampling respectively. First, we examine the proportional breakdown of our target variable with the table and prop.table functions. Specifically we want to view the churn variable. We can see below that we have roughly 5% of the customers in our data churn. We then split our data into training and validation/testing. This is easy to do in R with the sample function. We sample from a sequence of numbers (seq_len function) that is the length of the number of rows in my data set (nrow function). The size of the sample (size = option) is going to be roughly 70% (or 0.7) of the number of rows. This provides us with a list of row ID’s for our data. The training data is defined as the rows that have these row ID’s and the validation data set is defined as the rows that don’t have these ID’s. The table functions then show the number of observations in each data set for the churn variable.


FALSE  TRUE 
 2850   154 

     FALSE       TRUE 
0.94873502 0.05126498 

FALSE  TRUE 
 1995   107 

FALSE  TRUE 
  855    47 

The hardest part about the ubOver and ubUnder functions is the syntax. THe two main options in each function are X = and Y = which represent a dataframe os the predictor variables and a vector of the target variable respectively. These are defined by column name to create the inputs and target objects. These pieces are then input into either the ubOver or ubUnder functions depending on which technique you want to use. From there, we recombine the created input dataframe and target vector with the cbind function. After some brief renaming of the target variable we can examine the training data sets for each approach.


FALSE  TRUE 
 1995  1995 

FALSE  TRUE 
  107   107 

We now have an undersampled data set to build models on. There are two typical ways to adjust models from over or under sampling - adjusting the intercept and weighting.

Let’s examine this data set and how we can adjust our models accordingly with each technique.

Adjusting the Intercept

One of the ways of adjusting your logistic regression model to account for the sampling to balance the events with non-events is by adjusting the intercept from the logistic regression model. The intercept is what sets the “baseline” probability in your model which is now too high. You can adjust your predicted probabilities from the regression model to account for this incorrect intercept. They can be adjusted with the following equation:

\[ \hat{p}_i = \frac{\hat{p}_i^* \rho_0 \pi_1}{(1-\hat{p}_i^*)\rho_1\pi_0 + \hat{p}_i^* \rho_0 \pi_1}\]

where \(\hat{p}_i^*\) is the unadjusted predicted value, \(\pi_1\) and \(\pi_0\) are the population proportions of your target variable categories (1 and 0), and \(\rho_1\) and \(\rho_0\) are the sample proportions of your target variable categories (1 and 0).

Luckily, this is rather easy to do in R as we can directly adjust the predicted probability vector with this equation. First, we build a logistic regression model as seen in the previous section using the glm function on our undersampled training data. We can use the predict function to obtain our predicted probabilities from this model. These are biased probabilities however because of our sampling. By simply applying the equation above on the predicted probability vector we can get unbiased predicted probabilities. We then store these as a new column in our validation data set.


Call:
glm(formula = churn ~ factor(international.plan) + factor(voice.mail.plan) + 
    total.day.charge + customer.service.calls, family = binomial(link = "logit"), 
    data = train_u)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-3.0049  -0.7499  -0.0596   0.8002   2.1917  

Coefficients:
                              Estimate Std. Error z value Pr(>|z|)    
(Intercept)                   -4.71202    0.77589  -6.073 1.25e-09 ***
factor(international.plan)yes  2.91300    0.58964   4.940 7.80e-07 ***
factor(voice.mail.plan)yes    -0.30174    0.43242  -0.698    0.485    
total.day.charge               0.09769    0.01829   5.341 9.26e-08 ***
customer.service.calls         0.63437    0.12268   5.171 2.33e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 296.67  on 213  degrees of freedom
Residual deviance: 217.88  on 209  degrees of freedom
AIC: 227.88

Number of Fisher Scoring iterations: 4
          2           5           9          11          16          17 
0.009815246 0.488633041 0.019242082 0.010848222 0.071763867 0.015565721 

We then store these as a new column in our validation data set.

Weighted Observations

The other technique for adjusting the model for over or under sampling is by building the model with weighted observations so the adjustment happens while building instead of after the fact. In weighting the observations we use weighted MLE instead of plain MLE since each observation doesn’t have the same weight in the estimation of the parameters for our model. The only question now is what are the weights. We derive the weights from the same formula we had for the intercept adjustment. The weight for the rare event is 1, while the weight for the non-event is \(\rho_1\pi_0/\rho_0\pi_1\). For our data set this makes the weights 1 and 18.49 respectively.

This is easily done in R by creating a new weight variable in our training data with the ifelse function. If our churn variable takes a value of TRUE then that observation has a weight of 1, while the other observations have a weight of 18.49. From there we build the logistic regression model as we learned in the previous section using the glm function. The only new option is the weights = option where you specify the vector of weights for the training data set so weighted MLE can be performed.


Call:
glm(formula = churn ~ factor(international.plan) + factor(voice.mail.plan) + 
    total.day.charge + customer.service.calls, family = binomial(link = "logit"), 
    data = train_u, weights = weight)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-5.6377  -0.9171   0.3496   2.0589   3.3592  

Coefficients:
                              Estimate Std. Error z value Pr(>|z|)    
(Intercept)                   -7.29928    0.52017 -14.033  < 2e-16 ***
factor(international.plan)yes  2.57544    0.28830   8.933  < 2e-16 ***
factor(voice.mail.plan)yes    -1.08471    0.33129  -3.274  0.00106 ** 
total.day.charge               0.10566    0.01363   7.754 8.90e-15 ***
customer.service.calls         0.42969    0.07811   5.501 3.78e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 843.97  on 213  degrees of freedom
Residual deviance: 681.96  on 209  degrees of freedom
AIC: 686.82

Number of Fisher Scoring iterations: 6
          2           5           9          11          16          17 
0.006357329 0.391666954 0.027787765 0.019310356 0.070133111 0.010490844 

Now we can analyze our results above knowing that the predictions and model have already been adjusted for our sampling because of the weighted approach. We use the predict function to obtain the predictions from our model and make them a new vector in our validation data set.

Which approach is better? The more common approach is the weighted observation approach. Empirical simulation studies have proven that for large sample sizes (n > 1000), the weighted approach is better. In small sample sizes (n < 1000), the adjusted intercept approach is only better when you correctly specify the model. In other words, if you plan on doing variable selection because you are unsure if you have the correct variables in your model, then your model may be misspecified in its current form until after your perform variable selection. That being the case, it is probably safer to use the weighted approach. This is why most people use the weighted approach.

Categorical Variables and Contrasts

Categorical variables can provide great value to any model. Specifically, we might be interested in the differences that might exist across categories in a logistic regression model with regards to our target variable. However, by default, logistic regression models only provide certain categorical comparisons based on the coding of our categorical variables. Through contrasts we can any two categories or combination of categories that we desire.

Let’s see how to do this in each of our softwares!

SAS

SAS and PROC LOGISTIC can easily test all or any combination of categorical variable categories for inference on the target variable. The ODDSRATIO statement allows us to test all possible combinations of a specific categorical variable’s levels. Here we are looking at the customer_service_calls variable which is being treated as a categorical variable. With the diff = all option, we test each combination of the levels of the variable for statistical differences.

ods html select OddsRatiosWald;
proc logistic data=churn_t;
   class international_plan(ref='no') voice_mail_plan(ref='no') 
         customer_service_calls(ref='0') / param=ref;
   model churn(event='TRUE') = international_plan
                               voice_mail_plan 
                               total_day_charge 
                               customer_service_calls / clodds=pl;
   weight weights;
   oddsratio customer_service_calls / diff=all;
run;
quit;
Odds Ratio Estimates and Wald Confidence Intervals
Odds Ratio Estimate 95% Confidence Limits
customer_service_calls 1 vs 0 1.189 0.617 2.293
customer_service_calls 2 vs 0 0.773 0.372 1.605
customer_service_calls 3 vs 0 1.511 0.694 3.292
customer_service_calls 4 vs 0 9.638 3.802 24.428
customer_service_calls 5 vs 0 8.570 3.191 23.015
customer_service_calls 6 vs 0 13.514 2.565 71.204
customer_service_calls 7 vs 0 >999.999 <0.001 >999.999
customer_service_calls 1 vs 2 1.539 0.793 2.987
customer_service_calls 1 vs 3 0.787 0.387 1.599
customer_service_calls 1 vs 4 0.123 0.051 0.298
customer_service_calls 1 vs 5 0.139 0.054 0.355
customer_service_calls 1 vs 6 0.088 0.017 0.452
customer_service_calls 1 vs 7 <0.001 <0.001 >999.999
customer_service_calls 2 vs 3 0.511 0.238 1.099
customer_service_calls 2 vs 4 0.080 0.032 0.200
customer_service_calls 2 vs 5 0.090 0.034 0.238
customer_service_calls 2 vs 6 0.057 0.011 0.310
customer_service_calls 2 vs 7 <0.001 <0.001 >999.999
customer_service_calls 3 vs 4 0.157 0.060 0.411
customer_service_calls 3 vs 5 0.176 0.064 0.486
customer_service_calls 3 vs 6 0.112 0.021 0.608
customer_service_calls 3 vs 7 <0.001 <0.001 >999.999
customer_service_calls 4 vs 5 1.125 0.372 3.401
customer_service_calls 4 vs 6 0.713 0.116 4.378
customer_service_calls 4 vs 7 <0.001 <0.001 >999.999
customer_service_calls 5 vs 6 0.634 0.101 3.983
customer_service_calls 5 vs 7 <0.001 <0.001 >999.999
customer_service_calls 6 vs 7 <0.001 <0.001 >999.999

In the table above you we have the statistical tests of all 28 combinations of the customer_service_calls variable levels. These tell us if there is a statistical difference between the two categories with respect to customers churning. You will notice some rather interesting odds ratios and confidence intervals for a couple of the tests. We will talk further about these issues in the convergence section below.

If instead you wanted to test certain combinations of the levels as compared to every combination, you can use the TEST statement in PROC LOGISTIC. In the first TEST statement below we are testing if 1 customer service call is equal to two customer service calls in terms of likelihood of churning. The second TEST statement is testing if 1 customer service call is different than the average of 4, 5, 6, and 7 customer service calls (in terms of their categories) with regards to churn. Note the names on these variables. That is the most difficult part of SAS TEST statements. It would probably be best to first run the model to get the names that SAS uses for these variables in the Parameter Estimate output before running their specific tests.

ods html select TestStmts;
proc logistic data=churn_t;
   class international_plan(ref='no') voice_mail_plan(ref='no') 
         customer_service_calls(ref='0') / param=ref;
   model churn(event='TRUE') = international_plan
                               voice_mail_plan 
                               total_day_charge 
                               customer_service_calls / clodds=pl;
   weight weights;
   test customer_service_cal1 = customer_service_cal2;
   test customer_service_cal1 = 0.25*customer_service_cal4 + 
                                  0.25*customer_service_cal5 +
                                  0.25*customer_service_cal6 +
                                  0.25*customer_service_cal7;
run;
quit;
Linear Hypotheses Testing Results
Label Wald
Chi-Square
DF Pr > ChiSq
Test 1 1.6260 1 0.2023
Test 2 0.0001 1 0.9919


As you can see from the results above, there are no statistical differences in the category combinations we were testing with the TEST statements.

R

R and the glht function can easily test all combinations of categorical variable categories for inference on the target variable. First we need the variables in a workable format. We need to make the categorical variables factors outside the glm function. This new factor representation of customer_service_calls is represented by the fcsc variable using the factor function. Then we fit our model as usual with the glm function. With the glht function, we test each combination of the levels of the variable for statistical differences. The linfct option specifies the linear function we are trying to test. This option can be used to test all combinations as we do below or individual combinations (like the TEST statement in SAS, but are not shown here). With the mcp function we specify that we want multiple comparisons with the Tukey adjustment on the fcsc variable.


Call:
glm(formula = churn ~ factor(international.plan) + factor(voice.mail.plan) + 
    total.day.charge + fcsc, family = binomial(link = "logit"), 
    data = train_u, weights = weight)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-5.2161  -0.8162  -0.0713   1.9952   3.5811  

Coefficients:
                               Estimate Std. Error z value Pr(>|z|)    
(Intercept)                    -6.96883    0.60632 -11.494  < 2e-16 ***
factor(international.plan)yes   2.58132    0.30755   8.393  < 2e-16 ***
factor(voice.mail.plan)yes     -1.29496    0.35846  -3.613 0.000303 ***
total.day.charge                0.11776    0.01503   7.837 4.62e-15 ***
fcsc1                          -0.44611    0.33399  -1.336 0.181641    
fcsc2                          -0.24862    0.34916  -0.712 0.476446    
fcsc3                           0.13420    0.39318   0.341 0.732856    
fcsc4                           0.86367    0.37282   2.317 0.020526 *  
fcsc5                           2.31434    0.50692   4.565 4.98e-06 ***
fcsc6                          20.79486  685.24089   0.030 0.975790    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 843.97  on 213  degrees of freedom
Residual deviance: 639.00  on 204  degrees of freedom
AIC: 654.22

Number of Fisher Scoring iterations: 14

     Simultaneous Tests for General Linear Hypotheses

Multiple Comparisons of Means: Tukey Contrasts


Fit: glm(formula = churn ~ factor(international.plan) + factor(voice.mail.plan) + 
    total.day.charge + fcsc, family = binomial(link = "logit"), 
    data = train_u, weights = weight)

Linear Hypotheses:
           Estimate Std. Error z value Pr(>|z|)    
1 - 0 == 0  -0.4461     0.3340  -1.336  0.80288    
2 - 0 == 0  -0.2486     0.3492  -0.712  0.98943    
3 - 0 == 0   0.1342     0.3932   0.341  0.99983    
4 - 0 == 0   0.8637     0.3728   2.317  0.19875    
5 - 0 == 0   2.3143     0.5069   4.565  < 0.001 ***
6 - 0 == 0  20.7949   685.2409   0.030  1.00000    
2 - 1 == 0   0.1975     0.3403   0.580  0.99651    
3 - 1 == 0   0.5803     0.3812   1.522  0.68795    
4 - 1 == 0   1.3098     0.3749   3.493  0.00646 ** 
5 - 1 == 0   2.7605     0.4961   5.565  < 0.001 ***
6 - 1 == 0  21.2410   685.2409   0.031  1.00000    
3 - 2 == 0   0.3828     0.3935   0.973  0.94922    
4 - 2 == 0   1.1123     0.3947   2.818  0.05696 .  
5 - 2 == 0   2.5630     0.5047   5.078  < 0.001 ***
6 - 2 == 0  21.0435   685.2409   0.031  1.00000    
4 - 3 == 0   0.7295     0.4314   1.691  0.57214    
5 - 3 == 0   2.1801     0.5309   4.107  < 0.001 ***
6 - 3 == 0  20.6607   685.2409   0.030  1.00000    
5 - 4 == 0   1.4507     0.5434   2.669  0.08539 .  
6 - 4 == 0  19.9312   685.2409   0.029  1.00000    
6 - 5 == 0  18.4805   685.2410   0.027  1.00000    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Adjusted p values reported -- single-step method)

In the table above you we have the statistical tests of 21 combinations of the customer_service_calls variable levels. Notice how there is no test containing 7 calls as we had in SAS. This is because our training data set in R didn’t have 7 customer service calls as a possibility. These tests tell us if there is a statistical difference between the two categories with respect to customers churning. You will notice some rather interesting odds ratios and confidence intervals for a couple of the tests. We will talk further about these issues in the convergence section below.

Convergence Problems

One of the downsides of maximum likelihood estimation is that there is no closed form solution in logistic regression. This means that an algorithm must converge and find the point that maximizes our logistic regression likelihood instead of just calculating a known answer like OLS in linear regression. This means that the algorithm might not converge.

Categorical variables are the typical culprit for causing problems in convergence (however, rarely continuous variables can do this as well). This lack of convergence from categorical variables is from linear separation or quasi-complete separation. Complete linear separation occurs when some combination of the predictors perfectly predict every outcome as you can see in the table below:

Variable Yes No Logit
Group A 100 0 Infinity
Group B 0 50 Negative Infinity

Quasi-complete separation occurs when the outcome can be perfectly predicted for only a subset of the data as shown in the table below:

Variable Yes No Logit
Group A 77 23 1.39
Group B 0 50 Negative Infinity

The reason this poses a problem is that the likelihood function doesn’t actually have a maximum.

Convergence Problems

Convergence Problems

Remember that the logistic function is completely bounded within 0 and 1. Since these categories perfectly predict the target variable, we need a prediction of 0 or 1 for the probability which cannot be obtained in the logistic regression without infinitly large (or small) parameter estimates.

Let’s see if this is a problem in our data set (hint… it is!) and how to address this problem in our softwares!

SAS

Ideally, we should explore all of our categorical variables before we ever model so separation issues shouldn’t be a surprise. Let’s examine our customer_service_calls variable using PROC FREQ.

proc freq data=churn_t;
   table customer_service_calls*churn;
run;
Frequency
Percent
Row Pct
Col Pct
Table of customer_service_calls by churn
customer_service_calls churn
FALSE TRUE Total
0
23
11.50
53.49
23.00
20
10.00
46.51
20.00
43
21.50
1
32
16.00
56.14
32.00
25
12.50
43.86
25.00
57
28.50
2
29
14.50
64.44
29.00
16
8.00
35.56
16.00
45
22.50
3
13
6.50
50.00
13.00
13
6.50
50.00
13.00
26
13.00
4
1
0.50
7.14
1.00
13
6.50
92.86
13.00
14
7.00
5
1
0.50
9.09
1.00
10
5.00
90.91
10.00
11
5.50
6
1
0.50
33.33
1.00
2
1.00
66.67
2.00
3
1.50
7
0
0.00
0.00
0.00
1
0.50
100.00
1.00
1
0.50
Total
100
50.00
100
50.00
200
100.00

We can see that the category of 7 customer service calls predicts churn perfectly - 1 person churned while 0 did not. Every other category has at least one person who churned and one who didn’t so this is an example of quasi-complete separation.

In case you don’t explore your data ahead of time, you will notice interesting results in categories with separation issues. Let’s examine the parameter estimates for 6 and 7 customer service calls.

Parameter DF Estimate Standard Error Wald Chi-Square Pr > ChiSq
customer_service_cal 6 1 2.9827 2.2991 1.6831 0.1945
customer_service_cal 7 1 18.4993 4416.3 0 0.9967

Notice the category of 7 compared to that of 6. Its parameter estimate is rather high (remember that odds ratios are exponentiated parameter estimates so 18.4993 is REALLY large). Also notice the inflated standard error in the thousands as comapred to single digits for the 6 category. This leads to a test statistic basically equal to 0 with a really high p-value. These are definite signs of convergence problems!

Once possible solution to convergence problems in logistic regression is to perform exact logistic regression. This method is computationally more expensive and should really only be used if you need an estimate for the specific category that is causing the separation issues. The firth option in PROC LOGISTIC can perform this analysis. The code is shown below but without results.

A more common solution to separation concerns is to transform the categorical variable. Instead of having a category with separation issues, you can combine this category with another category to remove the separation concerns. Below we are creating a “4+” category that contains the categories of 4, 5, 6, and 7. This will remove the separation issues as you can see as we rerun the logistic regression.

data churn_t;
   set churn_t;
   customer_service_calls_c = put(customer_service_calls, 2.);;
   if customer_service_calls > 3 then customer_service_calls_c = '4+';
run;

ods html select ParameterEstimates;
proc logistic data=churn_t;
   class international_plan(ref='no') voice_mail_plan(ref='no') 
         customer_service_calls_c(ref='0') / param=ref;
   model churn(event='TRUE') = international_plan
                               voice_mail_plan 
                               total_day_charge 
                               customer_service_calls_c / clodds=pl;
   weight weights;
run;
quit;
Analysis of Maximum Likelihood Estimates
Parameter DF Estimate Standard
Error
Wald
Chi-Square
Pr > ChiSq
Intercept 1 -5.9939 0.5611 114.1161 <.0001
international_plan yes 1 2.0797 0.2760 56.7805 <.0001
voice_mail_plan yes 1 -1.3166 0.3316 15.7634 <.0001
total_day_charge 1 0.0843 0.0134 39.3570 <.0001
customer_service_cal 1 1 0.1656 0.3335 0.2467 0.6194
customer_service_cal 2 1 -0.2577 0.3716 0.4809 0.4880
customer_service_cal 3 1 0.4038 0.3959 1.0403 0.3078
customer_service_cal 4+ 1 2.3213 0.3773 37.8501 <.0001

Ordinal variables are easy to combine categories because you can just combine separation categories with categories on either side. However, with nominal variables we typically combine the separation causing category with the category that has the relationship with the target variable that is the most similar.

R

Ideally, we should explore all of our categorical variables before we ever model so separation issues shouldn’t be a surprise. Let’s examine our customer_service_calls variable using the table function.

   
    FALSE TRUE
  0    17   23
  1    46   21
  2    24   18
  3    14   12
  4     5   21
  5     1    8
  6     0    4

We can see that the category of 6 customer service calls predicts churn perfectly - 4 person churned while 0 did not. Every other category has at least one person who churned and one who didn’t so this is an example of quasi-complete separation.

In case you don’t explore your data ahead of time, you will notice interesting results in categories with separation issues. Let’s examine the parameter estimates for 5 and 6 customer service calls.


Call:
brglm(formula = churn ~ factor(international.plan) + factor(voice.mail.plan) + 
    total.day.charge + factor(customer.service.calls), family = binomial(link = "logit"), 
    data = train_u, weights = weight)


Coefficients:
                                Estimate Std. Error z value Pr(>|z|)    
(Intercept)                     -6.87405    0.59581 -11.537  < 2e-16 ***
factor(international.plan)yes    2.55076    0.30400   8.391  < 2e-16 ***
factor(voice.mail.plan)yes      -1.25465    0.34895  -3.595 0.000324 ***
total.day.charge                 0.11608    0.01478   7.853 4.05e-15 ***
factor(customer.service.calls)1 -0.44469    0.32884  -1.352 0.176272    
factor(customer.service.calls)2 -0.24499    0.34398  -0.712 0.476330    
factor(customer.service.calls)3  0.14091    0.38606   0.365 0.715116    
factor(customer.service.calls)4  0.85071    0.36869   2.307 0.021034 *  
factor(customer.service.calls)5  2.30700    0.50256   4.590 4.42e-06 ***
factor(customer.service.calls)6  7.46866    1.77541   4.207 2.59e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 812.97  on 213  degrees of freedom
Residual deviance: 639.95  on 204  degrees of freedom
Penalized deviance: 612.5096 
AIC:  655 

Notice the category of 6 compared to that of 5. Its parameter estimate is rather high (remember that odds ratios are exponentiated parameter estimates so 7.46866 is REALLY large). These are definite signs of convergence problems!

Once possible solution to convergence problems in logistic regression is to perform exact logistic regression. This method is computationally more expensive and should really only be used if you need an estimate for the specific category that is causing the separation issues. The logistf package in R can perform this analysis but is not shown here.

A more common solution to separation concerns is to transform the categorical variable. Instead of having a category with separation issues, you can combine this category with another category to remove the separation concerns. Below we are creating a “4+” category that contains the categories of 4, 5, and 6. This will remove the separation issues as you can see as we rerun the logistic regression.

    
     FALSE TRUE
  0     17   23
  1     46   21
  2     24   18
  3     14   12
  4+     6   33

Call:
glm(formula = churn ~ factor(international.plan) + factor(voice.mail.plan) + 
    total.day.charge + factor(customer.service.calls.c), family = binomial(link = "logit"), 
    data = train_u, weights = weight)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-5.6998  -0.8930   0.2868   2.0995   3.4518  

Coefficients:
                                   Estimate Std. Error z value Pr(>|z|)    
(Intercept)                        -6.17765    0.56688 -10.898  < 2e-16 ***
factor(international.plan)yes       2.29116    0.29509   7.764 8.21e-15 ***
factor(voice.mail.plan)yes         -1.22091    0.33861  -3.606 0.000311 ***
total.day.charge                    0.09754    0.01427   6.836 8.12e-12 ***
factor(customer.service.calls.c)1  -0.50788    0.32827  -1.547 0.121829    
factor(customer.service.calls.c)2  -0.29729    0.34250  -0.868 0.385393    
factor(customer.service.calls.c)3   0.04459    0.38710   0.115 0.908285    
factor(customer.service.calls.c)4+  1.38866    0.32244   4.307 1.66e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 843.97  on 213  degrees of freedom
Residual deviance: 674.89  on 206  degrees of freedom
AIC: 685.74

Number of Fisher Scoring iterations: 6

Ordinal variables are easy to combine categories because you can just combine separation categories with categories on either side. However, with nominal variables we typically combine the separation causing category with the category that has the relationship with the target variable that is the most similar.

Subset Selection and Diagnostics

WHich variables should you drop from your model? This is a common question for all modeling, but especially logistic regression. In this section we will cover a popular variable selection technique - stepwise selection. This isn’t the only possible techinque, but will be the primary focus here.

We will be using the birthweight data set to model the association between various factors and a child being born with low birth weight (<2.5kg). The variables in the data set are the following:

Birth Weight Data Dictionary
Variable Description
age mother’s age (years)
lwt mother’s weight at last menstrual period (lbs)
smoke mother’s smoking status during pregnancy
race mother’s race (1 = White, 2 = Black, 3 = Other)
ptl number of premature labors
ht history of hypertension
ui uterine irritability
ftv number of physician visits during first trimester

Stepwise Selection

Stepwise selection techniques involve the three common methods:

  1. Forward Selection
  2. Backward Selection
  3. Stepwise Selection

These techniques add or remove (depending the technique) one variable at a time from your regression model to try and “improve” the model. There are a variety of different selection criteria to use to add or remove variables from a logistic regression. The softwares we explore use different approaches so let’s examine each one!

SAS

PROC LOGISTIC uses significance levels and p-values for adding or removing variables from the stepwise selection techniques. Although p-values are falling out of popularity, it is primarily because people often use the 0.05 significance level without any regards to sample size. Although 0.05 is a good significance level for a sample size around 50, this level should be adjusted based on sample size. The mathematical details are not covered here, but it can be shown that the BIC criterion for adding or removing variables with stepwise selection (which is becoming very popular) is the same thing as using p-values in likelihood ratio tests. However, the significance level changes depending on sample size due to the BIC equation, which is exactly what is recommended for any selection techniques using p-values.

Stepwise

Here we will work through stepwise selection. In stepwise selection, the initial model is empty (contains no variables, only the intercept). Each variable is then tried to see if it is significant based on p-value and a specified significance level. The most significant variable (if any) is then added to the model. Next, all variables in the model (here only one) are tested to see if they are still significant. If not, they are dropped. If so, then the remaining variables are again tested and the next most significant variable is added to the model. This process repeats until either no more significant variables are available to add or the same variable keeps being added and then removed from the model.

PROC LOGISTIC uses the selection = stepwise option to perform this technique. Notice also the slentry = and slstay = options where we specify the significance to both enter and stay in the model respectively. Here we chose 0.03 based on our sample size.

ods html select ModelBuildingSummary ParameterEstimates ORPlot;
proc logistic data=logistic.lowbwt plots(only)=(oddsratio);
    class race(ref='white') / param=ref; 
    model low(event='1') = age race lwt smoke / selection=stepwise slentry=0.03 slstay=0.03 clodds=pl clparm=pl;
    title 'Modeling Low Birth Weight';
run;
quit;
Modeling Low Birth Weight

Summary of Stepwise Selection
Step Effect DF Number
In
Score
Chi-Square
Wald
Chi-Square
Pr > ChiSq
Entered Removed
1 lwt 1 1 5.4382 0.0197

Analysis of Maximum Likelihood Estimates
Parameter DF Estimate Standard
Error
Wald
Chi-Square
Pr > ChiSq
Intercept 1 0.9983 0.7853 1.6161 0.2036
lwt 1 -0.0141 0.00617 5.1921 0.0227

Plot of Odds Ratios with 95% Profile-Likelihood Confidence Limits


The model building summary is shown to have only added one variable - lwt. This may be different across different stepwise techniques.

Backward

Here we will work through backward selection. In backward selection, the initial model is full (contains all variables, including the intercept). Each variable is then tried to see if it is significant based on p-value and a specified significance level. The least significant variable (if any) is then dropped from the model. Then the remaining variables are again tested and the next least significant variable is dropped from the model. This process repeats until no more insignificant variables are available to drop.

PROC LOGISTIC uses the selection = backward option to perform this technique. Notice also the slstay = option where we specify the significance to stay in the model. Here we chose 0.03 based on our sample size.

ods html select ModelBuildingSummary ParameterEstimates ORPlot;
proc logistic data=logistic.lowbwt plots(only)=(oddsratio);
    class race(ref='white') / param=ref; 
    model low(event='1') = age race lwt smoke / selection=backward slstay=0.03 clodds=pl clparm=pl;
    title 'Modeling Low Birth Weight';
run;
quit;
Modeling Low Birth Weight

Summary of Backward Elimination
Step Effect
Removed
DF Number
In
Wald
Chi-Square
Pr > ChiSq
1 age 1 3 0.4326 0.5107
2 lwt 1 2 4.4149 0.0356

Analysis of Maximum Likelihood Estimates
Parameter DF Estimate Standard
Error
Wald
Chi-Square
Pr > ChiSq
Intercept 1 -1.8405 0.3529 27.2065 <.0001
race black 1 1.0841 0.4900 4.8951 0.0269
race other 1 1.1086 0.4003 7.6689 0.0056
smoke 1 1.1160 0.3692 9.1357 0.0025

Plot of Odds Ratios with 95% Profile-Likelihood Confidence Limits


The model building summary is shown to have only dropped two variables - lwt and age. This leaves race and smoke in our model as shown in the parameter estimates table. This may be different across different stepwise techniques.

Forward with Interactions

Here we will work through forward selection. In forward selection, the initial model is empty (contains no variables, only the intercept). Each variable is then tried to see if it is significant based on p-value and a specified significance level. The most significant variable (if any) is then added to the model. Then the remaining variables are again tested and the next most significant variable is added to the model. This process repeats until no more significant variables are available to add.

PROC LOGISTIC uses the selection = forward option to perform this technique. Notice also the slentry = option where we specify the significance to enter in the model. Here we chose 0.03 based on our sample size.

Foward selection is the least used technique because stepwise selection does the same as forward selection with the added benefit of dropping insignificant variables. The main use for forward selection is to test higher order terms and interactions in models. Here we are testing not only the variables, but their two-way interactions. This can be done in PROC LOGISTIC by placing a “|” in between variables. The “@2” is used to specify that we want only two-way interactions between these variables. The include = option forces SAS to include the first four variables - our main effects. This forces SAS to check all two-way interactions one at a time without having all of them in the model from the beginning like backward selection with interactions would do.

ods html select ModelBuildingSummary ParameterEstimates ORPlot;
proc logistic data=logistic.lowbwt plots(only)=(oddsratio);
    class race(ref='white') / param=ref; 
    model low(event='1') = age race lwt smoke age|race|lwt|smoke @2 / selection=forward slentry=0.05 clodds=pl clparm=pl include=4;
    title 'Modeling Low Birth Weight';
run;
quit;
Modeling Low Birth Weight


Analysis of Maximum Likelihood Estimates
Parameter DF Estimate Standard
Error
Wald
Chi-Square
Pr > ChiSq
Intercept 1 0.3324 1.1077 0.0900 0.7641
age 1 -0.0225 0.0342 0.4326 0.5107
race black 1 1.2316 0.5171 5.6718 0.0172
race other 1 0.9432 0.4162 5.1351 0.0234
lwt 1 -0.0125 0.00639 3.8470 0.0498
smoke 1 1.0544 0.3800 7.6991 0.0055

Plot of Odds Ratios with 95% Profile-Likelihood Confidence Limits


Notice that the modeling summary is empty as no interaction is found significant at our significance level. The model then remains at only the main effects as we told SAS to include these. From here was can use the other two techniques to select down further.

R

R’s step function uses either AIC or BIC adding or removing variables from the stepwise selection techniques. Although p-values are falling out of popularity, it is primarily because people often use the 0.05 significance level without any regards to sample size. Although 0.05 is a good significance level for a sample size around 50, this level should be adjusted based on sample size. The mathematical details are not covered here, but it can be shown that the BIC criterion for adding or removing variables with stepwise selection (which is becoming very popular) is the same thing as using p-values in likelihood ratio tests. However, the significance level changes depending on sample size due to the BIC equation, which is exactly what is recommended for any selection techniques using p-values.

Stepwise

Here we will work through stepwise selection. In stepwise selection, the initial model is empty (contains no variables, only the intercept). Each variable is then tried to see if it is significant based on p-value and a specified significance level. The most significant variable (if any) is then added to the model. Next, all variables in the model (here only one) are tested to see if they are still significant. If not, they are dropped. If so, then the remaining variables are again tested and the next most significant variable is added to the model. This process repeats until either no more significant variables are available to add or the same variable keeps being added and then removed from the model.

In R we must specify two models - the empty and full models - to step between. Notice the full model contains all the variables, while the empty model contains only the intercept (signified by ~ 1 in the formula). We then use the step function where we start at the empty model. The scope option tells R the lower (lower =) and upper (upper =) model to step between. The direction = "both" option tells R to use stepwise selection.

Start:  AIC=236.67
low ~ 1

                Df Deviance    AIC
+ lwt            1   228.69 232.69
+ factor(smoke)  1   229.81 233.81
+ factor(race)   2   229.66 235.66
+ age            1   231.91 235.91
<none>               234.67 236.67

Step:  AIC=232.69
low ~ lwt

                Df Deviance    AIC
+ factor(smoke)  1   224.34 230.34
+ factor(race)   2   223.26 231.26
<none>               228.69 232.69
+ age            1   227.12 233.12
- lwt            1   234.67 236.67

Step:  AIC=230.34
low ~ lwt + factor(smoke)

                Df Deviance    AIC
+ factor(race)   2   215.01 225.01
<none>               224.34 230.34
+ age            1   222.88 230.88
- factor(smoke)  1   228.69 232.69
- lwt            1   229.81 233.81

Step:  AIC=225.01
low ~ lwt + factor(smoke) + factor(race)

                Df Deviance    AIC
<none>               215.01 225.01
+ age            1   214.58 226.58
- lwt            1   219.97 227.97
- factor(race)   2   224.34 230.34
- factor(smoke)  1   223.26 231.26

Call:
glm(formula = low ~ lwt + factor(smoke) + factor(race), family = binomial(link = "logit"), 
    data = bwt)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.5278  -0.9053  -0.5863   1.2878   2.0364  

Coefficients:
                  Estimate Std. Error z value Pr(>|z|)   
(Intercept)        1.18087    1.00983   1.169  0.24225   
lwt               -0.01326    0.00631  -2.101  0.03562 * 
factor(smoke)1     1.06001    0.37832   2.802  0.00508 **
factor(race)other -0.31958    0.52560  -0.608  0.54317   
factor(race)white -1.29009    0.51087  -2.525  0.01156 * 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 234.67  on 188  degrees of freedom
Residual deviance: 215.01  on 184  degrees of freedom
AIC: 225.01

Number of Fisher Scoring iterations: 4

The model building summary is shown to have added three variables - lwt, smoke, and race - based on lowest AIC. This may be different across different stepwise techniques.

Backward

Here we will work through backward selection. In backward selection, the initial model is full (contains all variables, including the intercept). Each variable is then tried to see if it is significant based on p-value and a specified significance level. The least significant variable (if any) is then dropped from the model. Then the remaining variables are again tested and the next least significant variable is dropped from the model. This process repeats until no more insignificant variables are available to drop.

In R we must specify one model - the full model - to step backward from. Notice the full model contains all the variables. We then use the step function where we start at the full model. The direction = "backward" option tells R to use backward selection.

Start:  AIC=226.58
low ~ age + lwt + factor(smoke) + factor(race)

                Df Deviance    AIC
- age            1   215.01 225.01
<none>               214.58 226.58
- lwt            1   218.86 228.86
- factor(race)   2   222.88 230.88
- factor(smoke)  1   222.66 232.66

Step:  AIC=225.01
low ~ lwt + factor(smoke) + factor(race)

                Df Deviance    AIC
<none>               215.01 225.01
- lwt            1   219.97 227.97
- factor(race)   2   224.34 230.34
- factor(smoke)  1   223.26 231.26

Call:
glm(formula = low ~ lwt + factor(smoke) + factor(race), family = binomial(link = "logit"), 
    data = bwt)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.5278  -0.9053  -0.5863   1.2878   2.0364  

Coefficients:
                  Estimate Std. Error z value Pr(>|z|)   
(Intercept)        1.18087    1.00983   1.169  0.24225   
lwt               -0.01326    0.00631  -2.101  0.03562 * 
factor(smoke)1     1.06001    0.37832   2.802  0.00508 **
factor(race)other -0.31958    0.52560  -0.608  0.54317   
factor(race)white -1.29009    0.51087  -2.525  0.01156 * 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 234.67  on 188  degrees of freedom
Residual deviance: 215.01  on 184  degrees of freedom
AIC: 225.01

Number of Fisher Scoring iterations: 4

The model building summary is shown to have added three variables - lwt, smoke, and race - based on lowest AIC. This may be different across different stepwise techniques.

Forward with Interactions

Here we will work through forward selection. In forward selection, the initial model is empty (contains no variables, only the intercept). Each variable is then tried to see if it is significant based on p-value and a specified significance level. The most significant variable (if any) is then added to the model. Then the remaining variables are again tested and the next most significant variable is added to the model. This process repeats until no more significant variables are available to add.

In R we must specify two models to step between. We then use the step function where we start at the empty model. The scope option tells R the lower (lower =) and upper (upper =) model to step between. The direction = "forward" option tells R to use stepwise selection.

Foward selection is the least used technique because stepwise selection does the same as forward selection with the added benefit of dropping insignificant variables. The main use for forward selection is to test higher order terms and interactions in models. Here we are testing not only the variables, but their two-way interactions. This can be done in R by creating two models - the main effects model and the two-way interaction model. Here we set the starting point for forward selection at the main effects model and step up to the interaction model.

Start:  AIC=226.58
low ~ age + lwt + factor(smoke) + factor(race)

                             Df Deviance    AIC
<none>                            214.58 226.58
+ lwt:factor(smoke)           1   213.66 227.66
+ age:factor(smoke)           1   214.25 228.25
+ factor(smoke):factor(race)  2   212.29 228.29
+ age:lwt                     1   214.55 228.55
+ lwt:factor(race)            2   213.18 229.18
+ age:factor(race)            2   214.00 230.00

Call:
glm(formula = low ~ age + lwt + factor(smoke) + factor(race), 
    family = binomial(link = "logit"), data = bwt)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.5173  -0.9065  -0.5865   1.3035   2.0401  

Coefficients:
                   Estimate Std. Error z value Pr(>|z|)   
(Intercept)        1.564123   1.167005   1.340  0.18015   
age               -0.022478   0.034170  -0.658  0.51065   
lwt               -0.012526   0.006386  -1.961  0.04982 * 
factor(smoke)1     1.054439   0.380000   2.775  0.00552 **
factor(race)other -0.288409   0.526756  -0.548  0.58402   
factor(race)white -1.231671   0.517152  -2.382  0.01724 * 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 234.67  on 188  degrees of freedom
Residual deviance: 214.58  on 183  degrees of freedom
AIC: 226.58

Number of Fisher Scoring iterations: 4

Notice that the modeling summary contains no addition of any interactions as they did not reduce the AIC. The model then remains at only the main effects as we told R to include these. From here was can use the other two techniques to select down further.

Diagnostics Plots

Linear regression models contain residuals with properties that are very useful for model diagnostics. However, what is a residual in a logistic regression model? Since we don’t have actual probabilities we are comparing our predicted probabilities against, residuals are not as clearly defined. Instead we have pseudo “residuals” in logistic regression that we can explore further. Two exmaples of this are deviance residuals and Pearson residuals.

Deviance is a measure of how far a fitted model is from the fully saturated model. The fully saturated model is a model that predicts our data perfectly by essentially overfitting to it - a variable for each unique combination of inputs. This makes this model impractical for use, but good for comparison. The deviance is essentially our “error” from this “perfect” model. Logistic regression minimizes the sum of the squared deviances. Deviance residuals tell us how much each observation reduces the deviance.

Pearon residuals on the other hand tell us how much each observation changes the Pearson Chi-squared test for the overall model.

Other forms of measuring an observations influence on the logistic regression model are DFBetas and Cook’s D. Similar to their interpretation in linear regression, these two calculations tell us how each observation changes the estimation of each parameter individually (DFBeta) or how each observation changes the estimation of all the parameters wholistically (Cook’s D).

Let’s see how to get all of these from our softwares!

SAS

SAS has some wonderful diagnostics plots to show us these residuals. In PROC LOGISTIC we ask for the influence and dfbetas plots from the plots = option.

ods html select ParameterEstimates InfluencePlots DfBetasPlot;
proc logistic data=logistic.lowbwt plots(only label)=(influence dfbetas);
    class race(ref='white') / param=ref; 
    model low(event='1') = race lwt smoke;
    title 'Modeling Low Birth Weight';
run;
quit;
Modeling Low Birth Weight

Analysis of Maximum Likelihood Estimates
Parameter DF Estimate Standard
Error
Wald
Chi-Square
Pr > ChiSq
Intercept 1 -0.1093 0.8821 0.0153 0.9014
race black 1 1.2900 0.5109 6.3766 0.0116
race other 1 0.9705 0.4122 5.5423 0.0186
lwt 1 -0.0133 0.00631 4.4149 0.0356
smoke 1 1.0600 0.3783 7.8500 0.0051

Influence Plots: Panel 1


Influence Plots: Panel 2


DfBetas Plots: Panel 1


DfBetas Plots: Panel 2


The influence plots show a variety of plots. Among them are the Pearson residuals, Deviance residuals, and the difference in each of these for each observation. The main thing to look for in these plots are points that are far away from the rest of the observations.

The DFBetas plots show the standardized impact of each observation on the calculation of each of the parameters in the model. The main thing to look for in these plots are points that are far away from the rest of the observations.

These observations are not necessarily bad per se, but have a large influence on the model. These points might need to be investigated further to see if they are actually valid observations.

R

R has some wonderful diagnostics plots to show us these residuals. R also produces a list of these measures of influence as well as many more with the influence.measures function. Below only the first 6 observations are shown using the head function, but this is calculated for each of the observations. The 4th plot in the plot function on the logistic regression model object is the Cook’s D plot as shown below. The dfbetasPlots function produces the DFBetas plots, but only one is shown here.


Call:
glm(formula = low ~ lwt + factor(smoke) + factor(race), family = binomial(link = "logit"), 
    data = bwt)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.5278  -0.9053  -0.5863   1.2878   2.0364  

Coefficients:
                  Estimate Std. Error z value Pr(>|z|)   
(Intercept)        1.18087    1.00983   1.169  0.24225   
lwt               -0.01326    0.00631  -2.101  0.03562 * 
factor(smoke)1     1.06001    0.37832   2.802  0.00508 **
factor(race)other -0.31958    0.52560  -0.608  0.54317   
factor(race)white -1.29009    0.51087  -2.525  0.01156 * 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 234.67  on 188  degrees of freedom
Residual deviance: 215.01  on 184  degrees of freedom
AIC: 225.01

Number of Fisher Scoring iterations: 4
        dfb.1_     dfb.lwt    dfb.f()1 dfb.fctr(rc)t dfb.fctr(rc)w      dffit
1  0.005482474 -0.07059196  0.04099957   0.084169785   0.077118444 -0.1478710
2  0.056206296 -0.06625296  0.01670316  -0.053637750  -0.017316670 -0.1025779
3 -0.041579791  0.05546389 -0.04949097   0.007793374  -0.033786700 -0.1376476
4 -0.033066392  0.04576740 -0.04891015   0.005121733  -0.034739354 -0.1315161
5 -0.035866669  0.04895752 -0.04910518   0.006000057  -0.034429278 -0.1334895
6  0.006926211 -0.01229911  0.02585701  -0.048071657  -0.008021665 -0.1026993
     cov.r      cook.d        hat
1 1.063374 0.002920437 0.04561105
2 1.037861 0.001411803 0.02226879
3 1.025510 0.002869358 0.02136575
4 1.025050 0.002598127 0.02018000
5 1.025185 0.002683911 0.02055592
6 1.025764 0.001499369 0.01588439

The DFBetas plots show the standardized impact of each observation on the calculation of each of the parameters in the model. The main thing to look for in these plots are points that are far away from the rest of the observations.

These observations are not necessarily bad per se, but have a large influence on the model. These points might need to be investigated further to see if they are actually valid observations.

Calibration Curves

Another evaluation/diagnostic for logistic regression is the calibration curve. The calibration curve is a goodness-of-fit measure for logistic regression. Calibration measures how well predicted probabilities agree with actual frequency counts of outcomes (estimated linearly across the data set). These curves can help detect bias in your model - if predictions are consistently too high or low.

If the curve is above the diagonal line, this indicates the model is predicting lower probabilities than actually observed. The opposite is true if the curve is below the diagonal line.

This is best used on larger samples since we are calculating the observed proportion of events in the data. In smaller samples this relationship is extrapolated out from the center and may not as accurately reflect the truth.

SAS

SAS does not produce a calibration curve by default, but we can easily create it ourselves. After building the model with PROC LOGISTIC and outputing the predicted probabilites from our training data using the OUTPUT statement, we then sort these predicted probabilities using PROC SORT. We do this to easily plot the predicted probabilities in the SGPLOT procedure.

In PROC SGPLOT we use our predicted probabilities data set - here called cali. We then use the LOESS statement to fit a LOESS regression to our target variable (low) using our predicted probabilites (PredProb). We allow for a cubic interpolation on this LOESS regression. The cml option produces a confidence interval around this curve. We then plot the diagonal line through the origin with a slope of 1. This produces our calibration curve.

proc logistic data=logistic.lowbwt noprint;
    class race(ref='white') / param=ref; 
    model low(event='1') = race lwt smoke;
    output out=cali predicted=PredProb;
run;

proc sort data=cali;  
    by PredProb; 
run;

proc sgplot data=cali noautolegend aspect=1;
   loess x=PredProb y=low / interpolation=cubic clm;
   lineparm x=0 y=0 slope=1 / lineattrs=(color=grey pattern=dash);
   title 'Calibration Curve for Birthweight Data';
run;

The SGPlot Procedure


Since the diagonal line is contained in the confidence interval for our calibration curve, we do not notice any significant number of over or under predictions even though the curve does take a slight downwards turn.

R

R produces a calibration curve using the givitiCalibrationBelt function. The inputs to this function are o = and e =. These are the observed target and expected target respectively. We place our actual target variable in the o = option and the predictions from our logistic regression model in the e= option. Since the model is being compared to training data the devel = internal option is specified. THe maxDeg = option sets the maximum degree being tested for the curve.


Call:
glm(formula = low ~ lwt + factor(smoke) + factor(race), family = binomial(link = "logit"), 
    data = bwt)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.5278  -0.9053  -0.5863   1.2878   2.0364  

Coefficients:
                  Estimate Std. Error z value Pr(>|z|)   
(Intercept)        1.18087    1.00983   1.169  0.24225   
lwt               -0.01326    0.00631  -2.101  0.03562 * 
factor(smoke)1     1.06001    0.37832   2.802  0.00508 **
factor(race)other -0.31958    0.52560  -0.608  0.54317   
factor(race)white -1.29009    0.51087  -2.525  0.01156 * 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 234.67  on 188  degrees of freedom
Residual deviance: 215.01  on 184  degrees of freedom
AIC: 225.01

Number of Fisher Scoring iterations: 4

$m
[1] 2

$p.value
[1] 0.2255517

Since the diagonal line is contained in the confidence interval for our calibration curve, we do not notice any significant number of over or under predictions even though the curve does take a slight downwards turn.

Model Assessment

One of the common concerns and questions in any model development is determining how “good” a mdoel is or how well it performs. There are many different factors that determine this and most depend on the goals for the model. There are typically two different purposes for modeling - estimation and prediction. Estimation quantifies the expected change in our target variable associated with some relationship or change in the predictor variables. Prediction on the other hand is focused on predicting new target observations. However, these goals are rarely seen in isolation as most people desire a blend of these goals for their models. This section will cover many of the popular metrics for model assessment.

We will be using the birthweight data set to model the association between various factors and a child being born with low birth weight (<2.5kg). The variables in the data set are the following:

Birth Weight Data Dictionary
Variable Description
age mother’s age (years)
lwt mother’s weight at last menstrual period (lbs)
smoke mother’s smoking status during pregnancy
race mother’s race (1 = White, 2 = Black, 3 = Other)
ptl number of premature labors
ht history of hypertension
ui uterine irritability
ftv number of physician visits during first trimester

Comparing Models

The first thing to remember about model assessment is that a model is only “good” in context with another model. All of these model metrics are truly mdoel comparisons. Is an accuracy of 80% good? Depends! If the previous model used has an accuracy of 90%, then no the new model is not good. However, if the previous model used has an accuracy of 70%, then yes the model is good. Although we will show many of the calculations, at no place will we say that you must meet a certain threshold for your models to be considered “good” because these metrics are designed for comparison.

Some common model metrics are based on deviance/likelihood calculations. Three common logistic regression metrics base don these are the following:

  1. AIC
  2. BIC
  3. Generalized (Nagelkerke) \(R^2\)

Without going into too much mathematical detail, the AIC is a crude, large sample approximation of leave-one-out cross validation. The BIC on the other hand favors a smaller models than the AIC as it penalizes model complexity more. In both AIC and BIC, lower values are better. However, there is no amount of lower that is better enough. Neither the AIC or BIC is necessarily better than the other however they may not always agree on the “best” model.

There are a number of “pseudo”-\(R^2\) metrics for logistic regression. Here, higher values indicate a better model. The Generalized (Nagelkerke) \(R^2\) is a metric to measure how much better a model (in terms of likelihood) as compared to the intercept only model. Therefore, we compare two mdoels with this to see which one the “more better” than the intercept compared to the other. Essentially, they are both compared to the same baseline so which beats that baseline by more. Even though it is bounded between 0 and 1, there is no interpretation to this metric like we had in linear regression.

Let’s see how we get these metrics in each of our softwares!

SAS

PROC LOGISTIC gives a lot of these metrics by default in the model fit statistics table provided in the output. However, to get the Generalized \(R^2\) we must use the rsq option in the MODEL statement.

ods html select ParameterEstimates FitStatistics RSquare;
proc logistic data=logistic.lowbwt plots(only)=(oddsratio);
    class race(ref='white') / param=ref; 
    model low(event='1') = race lwt smoke / rsq clodds=pl clparm=pl;
    title 'Modeling Low Birth Weight';
run;
quit;
Modeling Low Birth Weight

Model Fit Statistics
Criterion Intercept Only Intercept and
Covariates
AIC 236.672 225.015
SC 239.914 241.223
-2 Log L 234.672 215.015

R-Square 0.0988 Max-rescaled R-Square 0.1389

Analysis of Maximum Likelihood Estimates
Parameter DF Estimate Standard
Error
Wald
Chi-Square
Pr > ChiSq
Intercept 1 -0.1093 0.8821 0.0153 0.9014
race black 1 1.2900 0.5109 6.3766 0.0116
race other 1 0.9705 0.4122 5.5423 0.0186
lwt 1 -0.0133 0.00631 4.4149 0.0356
smoke 1 1.0600 0.3783 7.8500 0.0051

We can see from the table above what the Generalized \(R^2\) is for this model. The max-rescaled \(R^2\) value ensures that the bounds are 0 and 1 so this is the metric we use so we can compare different models.

R

R provides a lot of these metrics by default in the output of the summary function on our logistic regression objects. However, we can call them separately through the AIC, BIC, and PseudoR2 functions as well. Note the which = Nagelkerke option in the PseudoR2 function to get the correct \(R^2\) value.

[1] 225.0147
[1] 241.2234
Nagelkerke 
 0.1389144 

We can see from the table above what the Generalized \(R^2\) is for this model. Unlike SAS, this value has already been max-rescaled to be between 0 and 1.

Probability Metrics

Logistic regression is a model for predicting the probability of an event, not the occurrence of an event. Logistic regression can be used for classification as well. Good models should reflect both good metrics on probability and classification, but the importance of one over the other depends on the problem.

In this section we will focus on the probability metrics. We want our mdoel to assign higher predicted probabilities to events and lower predicted probabilities to non-events.

Coefficient of Discrimination

The coefficient of discrimination (or discrimination slope) is the difference in average predicted probability between the actual events and non-events.

\[ D = \bar{\hat{p}}_1 - \bar{\hat{p}}_0 \]

The bigger this difference, this better our model does as separating the events and non-events through the predicted probabilities. We can also examine the histograms of these predictions for comparison as well.

Let’s see how to calculate this in each of our softwares!

SAS

PROC LOGISTIC doesn’t provide the coefficient of discrimination through an option. However, we can easily calculate it ourselves from the predicted probabilities. Using the OUTPUT statement, we gather the predicted probabilities for our training set as we have previously seen. PROC SORT is then used to sort these predicted probabilites for graphing purporses. Next, we use PROC TTEST. We aren’t actually interested in the t-test, but really the calculation of the mean difference between these two categories and the visual representation of the two histograms - the predicted probabilities of the events and non-events. In PROC TTEST we use the preddprobs data set from PROC LOGISTIC. The CLASS statement defines the two groups we are trying to compare which is our target variable. The VAR statement defines the variable we are averaging and comparing, which is our predicted probabilities phat.

proc logistic data=logistic.lowbwt noprint;
    class race(ref='white') / param=ref; 
    model low(event='1') = race lwt smoke / clodds=pl clparm=pl;
    output out=predprobs p=phat;
run;
 
proc sort data=predprobs;
    by descending low;
run;

proc ttest data=predprobs order=data;
    ods select statistics summarypanel;
    class low;
    var phat;
    title 'Coefficient of Discrimination and Plots';
run;
Coefficient of Discrimination and Plots

Variable: phat (Estimated Probability)

low Method N Mean Std Dev Std Err Minimum Maximum
1 59 0.3776 0.1401 0.0182 0.1258 0.7028
0 130 0.2825 0.1407 0.0123 0.0580 0.6887
Diff (1-2) Pooled 0.0952 0.1405 0.0221
Diff (1-2) Satterthwaite 0.0952 0.0220

Summary Panel for phat


The coefficient of discrimination is the difference of the average which is reported on the bottom two lines of the table above. The standard error calculation is different, but our only focus is the actual mean of 0.095. We can also see the two histograms of our predictions. These histograms seems to have a lot of overlap, leading us to believe that our model has a lot events and non-events with similar predicted probabilities.

R

It is easy to calculate the difference of the average of two vectors in R which is exactly what we are doing. First, we use the predict function to gather the predictions from our model (type = "response" to get probabilities) and put them as a new column p_hat in our data frame. From there we just isolate out the events and non-events and save them in two vectors p1 and p0. This is done by conditioning the data frame to only have rows where either low == 1 or low == 0. Lastly, the coefficient of discrimination is just the difference of the mean of these two vectors. A plot fo the two histograms overlayed on each other is also provided for visual inspection.

The coefficient of discrimination is the difference of the average which is reported on the bottom two lines of the table above. The standard error calculation is different, but our only focus is the actual mean of 0.095. We can also see the two histograms of our predictions. These histograms seems to have a lot of overlap, leading us to believe that our model has a lot events and non-events with similar predicted probabilities.

Rank-order Statistics

Rank-order statistics measure how well a model orders the predicted probabilities. Three common metrics that summarize things together are concordance, discordance, and ties. In these metrics every single combination of an event and non-event are compared against each other (1 event vs. 1 non-event). A concordant pair is a pair with the event having a higher predicted probability than the non-event - the model got the rank correct. A discordant pair is a pair with the event having a lower predicted probability than the non-event - the mdoel got the rank wrong. A tied pair is a pair where the event and non-event have the same predicted probability - the model isn’t sure how to distinguish between them. Models with higher concordance are considered better. The interpretation on concordance is that for all possible event and non-event combinations, the model assigned the higher predicted probability to the observation with the event concordance% of the time.

There are a host of other metrics that are based on these rank-statistics such as the \(c\)-statistic, Somer’s D, and Kendall’s \(\tau_\alpha\). The calculations for these are as follows:

\[ c = Concordance + 1/2\times Tied \]

\[ D_{xy} = 2c - 1 \]

\[ \tau_\alpha = \frac{Condorant - discordant}{0.5*n*(n-1)} \]

With all of these, higher values of concordant pairs result in higher values of these metrics.

Let’s see how to calculate these in each of our softwares!

SAS

SAS provides all these metrics by default in the associations table part of the PROC LOGISTIC output.

ods html select Association ParameterEstimates;
proc logistic data=logistic.lowbwt plots(only)=(oddsratio);
    class race(ref='white') / param=ref; 
    model low(event='1') = race lwt smoke / clodds=pl clparm=pl;
    title 'Modeling Low Birth Weight';
run;
quit;
Modeling Low Birth Weight

Analysis of Maximum Likelihood Estimates
Parameter DF Estimate Standard
Error
Wald
Chi-Square
Pr > ChiSq
Intercept 1 -0.1093 0.8821 0.0153 0.9014
race black 1 1.2900 0.5109 6.3766 0.0116
race other 1 0.9705 0.4122 5.5423 0.0186
lwt 1 -0.0133 0.00631 4.4149 0.0356
smoke 1 1.0600 0.3783 7.8500 0.0051

Association of Predicted Probabilities and
Observed Responses
Percent Concordant 68.3 Somers' D 0.371
Percent Discordant 31.2 Gamma 0.373
Percent Tied 0.5 Tau-a 0.160
Pairs 7670 c 0.686

As we can see from the output, our model assigned the higher predicted probability to the observation with the low brith weight 68.3% of the time.

R

Although not provided immediately in the model summary, R can easily provide the metrics above with the Concordance and somersD functions.

$Concordance
[1] 0.6831812

$Discordance
[1] 0.3168188

$Tied
[1] 5.551115e-17

$Pairs
[1] 7670
[1] 0.3663625

As we can see from the output, our model assigned the higher predicted probability to the observation with the low brith weight 68.3% of the time.

Classification Metrics

Logistic regression is a model for predicting the probability of an event, not the occurrence of an event. Logistic regression can be used for classification as well. Good models should reflect both good metrics on probability and classification, but the importance of one over the other depends on the problem.

In this section we will focus on the classification metrics. We want a model to correctly classify events and non-events. Classification forces the model to predict either an event or non-event for each observation based on the predicted probability for that observation. For example, \(\hat{y}_1 = 1\) if \(\hat{p}_i > 0.5\). These are called cut-offs or thresholds. However, strict classification-based measures completely discard any information about the actual quality of the model’s predicted probabilities.

Many of the metrics around classification try to balance different pieces of the classification table (also called the confusion matrix). An example of one is shown below.

Classification Table

Classification Table

Let’s examine the different pieces of the classification table that people focus on togehther.

Sensitivity & Specificity

Sensitivity is the proportion of times you were able to predict an event in the groups of actual events. Of the actual events, the proportion of the time you correctly predicted an event. This is also called the true positive rate. This is also just another name for recall.

Sensitivity

Sensitivity

This is balanced typically with the sensitivity. Sensitivity is the proportion of times you were able to predict a non-event in the group of actual non-events. Of the actual non-events, the proportion of the time you correctly predicted non-event. This is also called the true negative rate.

Specificity

Specificity

These offset each other in a model. One could easily maximize one of these at the cost of the other. To get maximum sensitivity you can just predict all events, however this would drop your specificity to 0. The reverse is also true. Those who focus on sensitivity and specificity want balance in each. One measure for the optimal cut-off from a model is the Youden’s Index (or Youden’s J Statistic). This is easily calculated as \(J = sensitivity + specificity - 1\). The optimal cut-off for determining predicted events and non-events would be at the point where this is maximized.

Let’s see how to do this in each of our softwares!

SAS

PROC LOGISTIC produces classification tables for any cut-off that you determine. Remember that a cut-off is simply the point where you decide to predict an event and non-event from your predicted probabilities. The ctable option in the MODEL statement creates the classification table for us. The pprob option defines the cut-off (or range of cut-offs) for this classification table. We want to look at all classification tables for cut-offs ranging from a predicted probability of 0 to 0.98 by steps of 0.02. Next, we use a DATA STEP to take the classification table from PROC LOGISTIC and calculate the Youden J statistic for each of the cut-offs from the given sensitivity and specificity provided by SAS. From there we use PROC SORT to rank these by this youden variable and print the first 5 observations with PROC PRINT.

ods html select Classification;
proc logistic data=logistic.lowbwt plots(only)=(oddsratio);
    class race(ref='white') / param=ref; 
    model low(event='1') = race lwt smoke / ctable pprob = 0 to 0.98 by 0.02;
    ods output classification=classtable;
    title 'Modeling Low Birth Weight';
run;
quit;

data classtable;
    set classtable;
    youden = sensitivity + specificity - 100;
    drop PPV NPV Correct;
run;

proc sort data=classtable;
    by descending youden;
run;

proc print data=classtable (obs = 5);
run;
Modeling Low Birth Weight

Classification Table
Prob
Level
Correct Incorrect Percentages
Event Non-
Event
Event Non-
Event
Correct Sensi-
tivity
Speci-
ficity
Pos
Pred
Neg
Pred
0.000 59 0 130 0 31.2 100.0 0.0 31.2 .
0.020 59 0 130 0 31.2 100.0 0.0 31.2 .
0.040 59 0 130 0 31.2 100.0 0.0 31.2 .
0.060 59 1 129 0 31.7 100.0 0.8 31.4 100.0
0.080 59 4 126 0 33.3 100.0 3.1 31.9 100.0
0.100 59 10 120 0 36.5 100.0 7.7 33.0 100.0
0.120 58 15 115 1 38.6 98.3 11.5 33.5 93.8
0.140 57 26 104 2 43.9 96.6 20.0 35.4 92.9
0.160 54 35 95 5 47.1 91.5 26.9 36.2 87.5
0.180 53 43 87 6 50.8 89.8 33.1 37.9 87.8
0.200 53 45 85 6 51.9 89.8 34.6 38.4 88.2
0.220 51 47 83 8 51.9 86.4 36.2 38.1 85.5
0.240 48 51 79 11 52.4 81.4 39.2 37.8 82.3
0.260 47 53 77 12 52.9 79.7 40.8 37.9 81.5
0.280 45 56 74 14 53.4 76.3 43.1 37.8 80.0
0.300 43 62 68 16 55.6 72.9 47.7 38.7 79.5
0.320 38 69 61 21 56.6 64.4 53.1 38.4 76.7
0.340 33 82 48 26 60.8 55.9 63.1 40.7 75.9
0.360 26 90 40 33 61.4 44.1 69.2 39.4 73.2
0.380 19 98 32 40 61.9 32.2 75.4 37.3 71.0
0.400 14 105 25 45 63.0 23.7 80.8 35.9 70.0
0.420 9 111 19 50 63.5 15.3 85.4 32.1 68.9
0.440 9 113 17 50 64.6 15.3 86.9 34.6 69.3
0.460 9 119 11 50 67.7 15.3 91.5 45.0 70.4
0.480 9 119 11 50 67.7 15.3 91.5 45.0 70.4
0.500 9 120 10 50 68.3 15.3 92.3 47.4 70.6
0.520 9 122 8 50 69.3 15.3 93.8 52.9 70.9
0.540 8 123 7 51 69.3 13.6 94.6 53.3 70.7
0.560 8 123 7 51 69.3 13.6 94.6 53.3 70.7
0.580 7 124 6 52 69.3 11.9 95.4 53.8 70.5
0.600 7 124 6 52 69.3 11.9 95.4 53.8 70.5
0.620 7 125 5 52 69.8 11.9 96.2 58.3 70.6
0.640 4 127 3 55 69.3 6.8 97.7 57.1 69.8
0.660 3 127 3 56 68.8 5.1 97.7 50.0 69.4
0.680 2 127 3 57 68.3 3.4 97.7 40.0 69.0
0.700 0 129 1 59 68.3 0.0 99.2 0.0 68.6
0.720 0 130 0 59 68.8 0.0 100.0 . 68.8
0.740 0 130 0 59 68.8 0.0 100.0 . 68.8
0.760 0 130 0 59 68.8 0.0 100.0 . 68.8
0.780 0 130 0 59 68.8 0.0 100.0 . 68.8
0.800 0 130 0 59 68.8 0.0 100.0 . 68.8
0.820 0 130 0 59 68.8 0.0 100.0 . 68.8
0.840 0 130 0 59 68.8 0.0 100.0 . 68.8
0.860 0 130 0 59 68.8 0.0 100.0 . 68.8
0.880 0 130 0 59 68.8 0.0 100.0 . 68.8
0.900 0 130 0 59 68.8 0.0 100.0 . 68.8
0.920 0 130 0 59 68.8 0.0 100.0 . 68.8
0.940 0 130 0 59 68.8 0.0 100.0 . 68.8
0.960 0 130 0 59 68.8 0.0 100.0 . 68.8
0.980 0 130 0 59 68.8 0.0 100.0 . 68.8



Modeling Low Birth Weight

Obs ProbLevel CorrectEvents CorrectNonevents IncorrectEvents IncorrectNonevents Sensitivity Specificity FalsePositive FalseNegative youden
1 0.200 53 45 85 6 89.8 34.6 61.6 11.8 24.4459
2 0.180 53 43 87 6 89.8 33.1 62.1 12.2 22.9074
3 0.220 51 47 83 8 86.4 36.2 61.9 14.5 22.5945
4 0.240 48 51 79 11 81.4 39.2 62.2 17.7 20.5867
5 0.300 43 62 68 16 72.9 47.7 61.3 20.5 20.5737

We can see that the highest Youden J statistic had a value of 24.4459. This took place at a cut-off of 0.2. Therefore, according to the Youden Index at least, the optimal cut-off for our model is 0.2. In other words, if our model predicts a probability above 0.2 then we should call this an event. Any predicted probability below 0.2 should be called a non-event.

Another commonly used visual for the balance of sensitivity and specificity across all of the cut-offs is the Receiver Operator Characteristic curve. Commonly known as the ROC curve. The ROC curve plots the balance of sensitivity vs. specificity. The “best” ROC curve is the one that reaches to the upper left hand side of the plot as that would imply that our model has both high levels of sensitivity and specificity. The worst ROC curve is represented by the diagonal line in the plot since that would imply our model is as good as randomly assigning events and non-events to our observations. This leads some to calculate the area under the ROC curve (typically called AUC) as a metric summarizing the curve itself. The math won’t be shown here, but the AUC equal to the \(c\)-statistic in the Rank-order statistics section. Isn’t math fun!?!?

SAS easily produces ROC curves from PROC LOGISTIC using the plots=ROC option.

ods html select ROCCurve;
proc logistic data=logistic.lowbwt plots(only)=ROC;
    class race(ref='white') / param=ref; 
    model low(event='1') = race lwt smoke / clodds=pl clparm=pl;
    title 'Modeling Low Birth Weight';
run;
quit;
Modeling Low Birth Weight

ROC Curve for Model


SAS can also compare the ROC curves across many models statistically to see if they are different based on their AUC. This is done through ROC statements. The only downside is that SAS can only compare nested versions of the model in the MODEL statement. Notice how the two models in the ROC statements are just nested versions of the model in the MODEL statement. The ROCCONTRAST statement is then used with the estimate = allpairs option to statistically compare these models.

ods html select ROCContrastEstimate ROCOverlay ROCContrastTest;
proc logistic data=logistic.lowbwt plots(only)=ROC;
    class race(ref='white') / param=ref; 
    model low(event='1') = age race lwt smoke / clodds=pl clparm=pl;
    ROC 'Omit Age' race lwt smoke;
    ROC 'Omit Race' lwt smoke;
    ROCcontrast / estimate = allpairs;
    title 'Comparing ROC Curves';
run;
quit;
Comparing ROC Curves

ROC Curves for Comparisons


ROC Contrast Test Results
Contrast DF Chi-Square Pr > ChiSq
Reference = Model 2 1.7008 0.4273

ROC Contrast Estimation and Testing Results by Row
Contrast Estimate Standard
Error
95% Wald
Confidence Limits
Chi-Square Pr > ChiSq
Model - Omit Age -0.00183 0.00940 -0.0202 0.0166 0.0377 0.8460
Model - Omit Race 0.0432 0.0344 -0.0242 0.1107 1.5772 0.2092
Omit Age - Omit Race 0.0450 0.0345 -0.0227 0.1127 1.7008 0.1922

Although the ROC curves in the plot above look different then each other visually, statistically we cannot find any differences between them based on the p-values from the tests in the table above.

R

R produces classification tables for any cut-off that you determine. Remember that a cut-off is simply the point where you decide to predict an event and non-event from your predicted probabilities. The confusionMatrix function creates the classification table for us. The threshold option defines the cut-off for this classification table.

    0  1
0 122 50
1   8  9

We want to look at all classification tables for cut-offs ranging from a predicted probability of 0 to 0.98 by steps of 0.02. We can easily loop through this calculation with a for loop in R. In this loop we are calculating the sensivitiy and specificity at each threshold using the sensitivity and specificity functions. We also calculate the Youden J statistic for each of the cut-offs from the given sensitivity and specificity provided by R. From there we combine these variables into a single data frame print the observations with the print function.

   cutoff       sens        spec      youden
1    0.02 1.00000000 0.000000000 0.000000000
2    0.04 1.00000000 0.000000000 0.000000000
3    0.06 1.00000000 0.007692308 0.007692308
4    0.08 1.00000000 0.030769231 0.030769231
5    0.10 1.00000000 0.084615385 0.084615385
6    0.12 1.00000000 0.123076923 0.123076923
7    0.14 0.96610169 0.230769231 0.196870926
8    0.16 0.96610169 0.284615385 0.250717080
9    0.18 0.93220339 0.330769231 0.262972621
10   0.20 0.89830508 0.353846154 0.252151239
11   0.22 0.89830508 0.369230769 0.267535854
12   0.24 0.86440678 0.400000000 0.264406780
13   0.26 0.81355932 0.415384615 0.228943937
14   0.28 0.79661017 0.430769231 0.227379400
15   0.30 0.74576271 0.484615385 0.230378096
16   0.32 0.67796610 0.553846154 0.231812256
17   0.34 0.61016949 0.661538462 0.271707953
18   0.36 0.54237288 0.723076923 0.265449804
19   0.38 0.38983051 0.784615385 0.174445893
20   0.40 0.28813559 0.815384615 0.103520209
21   0.42 0.27118644 0.853846154 0.125032595
22   0.44 0.18644068 0.915384615 0.101825293
23   0.46 0.15254237 0.923076923 0.075619296
24   0.48 0.15254237 0.930769231 0.083311604
25   0.50 0.15254237 0.938461538 0.091003911
26   0.52 0.15254237 0.946153846 0.098696219
27   0.54 0.15254237 0.953846154 0.106388527
28   0.56 0.13559322 0.953846154 0.089439374
29   0.58 0.13559322 0.953846154 0.089439374
30   0.60 0.11864407 0.969230769 0.087874837
31   0.62 0.11864407 0.976923077 0.095567145
32   0.64 0.11864407 0.976923077 0.095567145
33   0.66 0.06779661 0.992307692 0.060104302
34   0.68 0.05084746 0.992307692 0.043155150
35   0.70 0.03389831 1.000000000 0.033898305
36   0.72 0.00000000 1.000000000 0.000000000
37   0.74 0.00000000 1.000000000 0.000000000
38   0.76 0.00000000 1.000000000 0.000000000
39   0.78 0.00000000 1.000000000 0.000000000
40   0.80 0.00000000 1.000000000 0.000000000
41   0.82 0.00000000 1.000000000 0.000000000
42   0.84 0.00000000 1.000000000 0.000000000
43   0.86 0.00000000 1.000000000 0.000000000
44   0.88 0.00000000 1.000000000 0.000000000
45   0.90 0.00000000 1.000000000 0.000000000
46   0.92 0.00000000 1.000000000 0.000000000
47   0.94 0.00000000 1.000000000 0.000000000
48   0.96 0.00000000 1.000000000 0.000000000
49   0.98 0.00000000 1.000000000 0.000000000

We can see that the highest Youden J statistic had a value of 0.2675. This took place at a cut-off of 0.22. Therefore, according to the Youden Index at least, the optimal cut-off for our model is 0.22. In other words, if our model predicts a probability above 0.22 then we should call this an event. Any predicted probability below 0.22 should be called a non-event.

Another commonly used visual for the balance of sensitivity and specificity across all of the cut-offs is the Receiver Operator Characteristic curve. Commonly known as the ROC curve. The ROC curve plots the balance of sensitivity vs. specificity. The “best” ROC curve is the one that reaches to the upper left hand side of the plot as that would imply that our model has both high levels of sensitivity and specificity. The worst ROC curve is represented by the diagonal line in the plot since that would imply our model is as good as randomly assigning events and non-events to our observations. This leads some to calculate the area under the ROC curve (typically called AUC) as a metric summarizing the curve itself. The math won’t be shown here, but the AUC equal to the \(c\)-statistic in the Rank-order statistics section. Isn’t math fun!?!?

R easily produces ROC curves from a variety of functions. The first combination below is the plotROC and AUROC functions that produce the ROC curve and AUC respectively.

[1] 0.6823338

Another common function is the performance function that produces many more plots than the ROC curve. Here the ROC curve is obtained by plotting the true positive rate by the false positive rate using the measure = "tpr" and "x.measure = "fpr" options. The AUC is also obtained from the performance function by calling the measure = "auc" option.

[[1]]
[1] 0.685528

Precision & Recall

Recall is the proportion of times you were able to predict an event in the groups of actual events. Of the actual events, the proportion of the time you correctly predicted an event. This is also called the true positive rate. This is also just another name for sensitivity.

Recall

Recall

This is balanced here with the precision. Precision is the proportion predicted events taht were actually events. Of the predicted events, the proportion of the time they actually were events. This is also called the positive predictive value. Precision is growing in popularity as a balance to recall/sensitivity as compared to specificity.

Precision

Precision

These offset each other in a model. One could easily maximize one of these at the cost of the other. To get maximum sensitivity you can just predict all events, however this would drop your precision. Those who focus on precision and recall want balance in each. One measure for the optimal cut-off from a model is the F1 Score. This is easily calculated as the following:

\[ F_1 = 2\times \frac{precision \times recall}{precision + recall} \]

The optimal cut-off for determining predicted events and non-events would be at the point where this is maximized.

Let’s see how to do this in each of our softwares!

SAS

PROC LOGISTIC produces classification tables for any cut-off that you determine. Remember that a cut-off is simply the point where you decide to predict an event and non-event from your predicted probabilities. The ctable option in the MODEL statement creates the classification table for us. The pprob option defines the cut-off (or range of cut-offs) for this classification table. We want to look at all classification tables for cut-offs ranging from a predicted probability of 0 to 0.98 by steps of 0.02. Next, we use a DATA STEP to take the classification table from PROC LOGISTIC and calculate the F1 Score for each of the cut-offs from the given precision and recall provided by SAS. From there we use PROC SORT to rank these by this F1 variable and print the first 5 observations with PROC PRINT.

ods html select Classification;
proc logistic data=logistic.lowbwt plots(only)=(oddsratio);
    class race(ref='white') / param=ref; 
    model low(event='1') = race lwt smoke / ctable pprob = 0 to 0.98 by 0.02;
    ods output classification=classtable;
    title 'Modeling Low Birth Weight';
run;
quit;

data classtable;
    set classtable;
    F1 = 2*(PPV*Sensitivity)/(PPV+Sensitivity);
    drop Specificity NPV Correct;
run;

proc sort data=classtable;
    by descending F1;
run;

proc print data=classtable (obs = 5);
run;
Modeling Low Birth Weight

Classification Table
Prob
Level
Correct Incorrect Percentages
Event Non-
Event
Event Non-
Event
Correct Sensi-
tivity
Speci-
ficity
Pos
Pred
Neg
Pred
0.000 59 0 130 0 31.2 100.0 0.0 31.2 .
0.020 59 0 130 0 31.2 100.0 0.0 31.2 .
0.040 59 0 130 0 31.2 100.0 0.0 31.2 .
0.060 59 1 129 0 31.7 100.0 0.8 31.4 100.0
0.080 59 4 126 0 33.3 100.0 3.1 31.9 100.0
0.100 59 10 120 0 36.5 100.0 7.7 33.0 100.0
0.120 58 15 115 1 38.6 98.3 11.5 33.5 93.8
0.140 57 26 104 2 43.9 96.6 20.0 35.4 92.9
0.160 54 35 95 5 47.1 91.5 26.9 36.2 87.5
0.180 53 43 87 6 50.8 89.8 33.1 37.9 87.8
0.200 53 45 85 6 51.9 89.8 34.6 38.4 88.2
0.220 51 47 83 8 51.9 86.4 36.2 38.1 85.5
0.240 48 51 79 11 52.4 81.4 39.2 37.8 82.3
0.260 47 53 77 12 52.9 79.7 40.8 37.9 81.5
0.280 45 56 74 14 53.4 76.3 43.1 37.8 80.0
0.300 43 62 68 16 55.6 72.9 47.7 38.7 79.5
0.320 38 69 61 21 56.6 64.4 53.1 38.4 76.7
0.340 33 82 48 26 60.8 55.9 63.1 40.7 75.9
0.360 26 90 40 33 61.4 44.1 69.2 39.4 73.2
0.380 19 98 32 40 61.9 32.2 75.4 37.3 71.0
0.400 14 105 25 45 63.0 23.7 80.8 35.9 70.0
0.420 9 111 19 50 63.5 15.3 85.4 32.1 68.9
0.440 9 113 17 50 64.6 15.3 86.9 34.6 69.3
0.460 9 119 11 50 67.7 15.3 91.5 45.0 70.4
0.480 9 119 11 50 67.7 15.3 91.5 45.0 70.4
0.500 9 120 10 50 68.3 15.3 92.3 47.4 70.6
0.520 9 122 8 50 69.3 15.3 93.8 52.9 70.9
0.540 8 123 7 51 69.3 13.6 94.6 53.3 70.7
0.560 8 123 7 51 69.3 13.6 94.6 53.3 70.7
0.580 7 124 6 52 69.3 11.9 95.4 53.8 70.5
0.600 7 124 6 52 69.3 11.9 95.4 53.8 70.5
0.620 7 125 5 52 69.8 11.9 96.2 58.3 70.6
0.640 4 127 3 55 69.3 6.8 97.7 57.1 69.8
0.660 3 127 3 56 68.8 5.1 97.7 50.0 69.4
0.680 2 127 3 57 68.3 3.4 97.7 40.0 69.0
0.700 0 129 1 59 68.3 0.0 99.2 0.0 68.6
0.720 0 130 0 59 68.8 0.0 100.0 . 68.8
0.740 0 130 0 59 68.8 0.0 100.0 . 68.8
0.760 0 130 0 59 68.8 0.0 100.0 . 68.8
0.780 0 130 0 59 68.8 0.0 100.0 . 68.8
0.800 0 130 0 59 68.8 0.0 100.0 . 68.8
0.820 0 130 0 59 68.8 0.0 100.0 . 68.8
0.840 0 130 0 59 68.8 0.0 100.0 . 68.8
0.860 0 130 0 59 68.8 0.0 100.0 . 68.8
0.880 0 130 0 59 68.8 0.0 100.0 . 68.8
0.900 0 130 0 59 68.8 0.0 100.0 . 68.8
0.920 0 130 0 59 68.8 0.0 100.0 . 68.8
0.940 0 130 0 59 68.8 0.0 100.0 . 68.8
0.960 0 130 0 59 68.8 0.0 100.0 . 68.8
0.980 0 130 0 59 68.8 0.0 100.0 . 68.8



Modeling Low Birth Weight

Obs ProbLevel CorrectEvents CorrectNonevents IncorrectEvents IncorrectNonevents Sensitivity FalsePositive FalseNegative PPV F1
1 0.200 53 45 85 6 89.8 61.6 11.8 38.4 53.8071
2 0.180 53 43 87 6 89.8 62.1 12.2 37.9 53.2663
3 0.220 51 47 83 8 86.4 61.9 14.5 38.1 52.8497
4 0.160 54 35 95 5 91.5 63.8 12.5 36.2 51.9231
5 0.140 57 26 104 2 96.6 64.6 7.1 35.4 51.8182

We can see that the highest F1 score had a value of 53.8071. This took place at a cut-off of 0.2. Therefore, according to the F1 score at least, the optimal cut-off for our model is 0.2. In other words, if our model predicts a probability above 0.2 then we should call this an event. Any predicted probability below 0.2 should be called a non-event.

Another common calculation using the precision of a model is the model’s lift. The lift of a model is simply calculated as the ratio of Precision to the population proportion of the event - \(Lift = PPV/\pi_1\). The interpretation of lift is really nice for explanation. Let’s imagine that your lift was 3 and your population proportion of events was 0.2. This means that in the top 20% of your customers, your model predicted 3 times the events as compared to you selecting people at random. Sometimes people plot lift charts where they plot the precision at all the different values of the population proportion (called depth).

Similar to before, we need SAS to calculate precision and recall at different cut-off values. Here we use the SCORE statement and the fitstat and outroc=roc options. This will produce the cut-offs and measures used by SAS to calculate the ROC curves. From these calculations we can derive the lift chart. Using a DATA STEP we can calculate all of the needed variables of cutoff, depth, and precision. From there we can calculate lift by using the precision as well as the population proportion of events in our low birth weight data set - here 0.3122. We then USE PROC SGPLOT to plot these.

proc logistic data=logistic.lowbwt plots(only)=(oddsratio) noprint;
    class race(ref='white') / param=ref; 
    model low(event='1') = race lwt smoke / clodds=pl clparm=pl;
    score data=logistic.lowbwt fitstat outroc=roc;
    title 'Modeling Low Birth Weight';
run;
quit;

data work.roc; 
    set work.roc; 
    cutoff = _PROB_; 
    specif = 1-_1MSPEC_; 
    depth=(_POS_+_FALPOS_)/189*100; 
    precision=_POS_/(_POS_+_FALPOS_); 
    acc=_POS_+_NEG_; 
    lift=precision/0.3122; 
run;

proc sgplot data=work.roc; 
    *where 0.005 <= depth <= 0.50; 
    series y=lift x=depth; 
    refline 1.0 / axis=y; 
    title1 "Lift Chart for Training Data"; 
    xaxis label="Depth (%)";
    yaxis label="Lift";
run; 
quit;

The SGPlot Procedure


R

R produces classification tables for any cut-off that you determine. Remember that a cut-off is simply the point where you decide to predict an event and non-event from your predicted probabilities. The confusionMatrix function creates the classification table for us. The threshold option defines the cut-off for this classification table.

We want to look at all classification tables for cut-offs ranging from a predicted probability of 0 to 0.98 by steps of 0.02. We can easily loop through this calculation with a for loop in R. In this loop we are calculating the sensivitiy and precision at each threshold using the sensitivity and precision functions. We also calculate the F1 score for each of the cut-offs from the given sensitivity/recall and precision provided by R. From there we combine these variables into a single data frame print the observations with the print function.

   cutoff       reca      prec         f1
1    0.02 1.00000000 0.3121693 0.47580645
2    0.04 1.00000000 0.3121693 0.47580645
3    0.06 1.00000000 0.3138298 0.47773279
4    0.08 1.00000000 0.3189189 0.48360656
5    0.10 1.00000000 0.3314607 0.49789030
6    0.12 1.00000000 0.3410405 0.50862069
7    0.14 0.96610169 0.3630573 0.52777778
8    0.16 0.96610169 0.3800000 0.54545455
9    0.18 0.93220339 0.3873239 0.54726368
10   0.20 0.89830508 0.3868613 0.54081633
11   0.22 0.89830508 0.3925926 0.54639175
12   0.24 0.86440678 0.3953488 0.54255319
13   0.26 0.81355932 0.3870968 0.52459016
14   0.28 0.79661017 0.3884298 0.52222222
15   0.30 0.74576271 0.3963964 0.51764706
16   0.32 0.67796610 0.4081633 0.50955414
17   0.34 0.61016949 0.4500000 0.51798561
18   0.36 0.54237288 0.4705882 0.50393701
19   0.38 0.38983051 0.4509804 0.41818182
20   0.40 0.28813559 0.4146341 0.34000000
21   0.42 0.27118644 0.4571429 0.34042553
22   0.44 0.18644068 0.5000000 0.27160494
23   0.46 0.15254237 0.4736842 0.23076923
24   0.48 0.15254237 0.5000000 0.23376623
25   0.50 0.15254237 0.5294118 0.23684211
26   0.52 0.15254237 0.5625000 0.24000000
27   0.54 0.15254237 0.6000000 0.24324324
28   0.56 0.13559322 0.5714286 0.21917808
29   0.58 0.13559322 0.5714286 0.21917808
30   0.60 0.11864407 0.6363636 0.20000000
31   0.62 0.11864407 0.7000000 0.20289855
32   0.64 0.11864407 0.7000000 0.20289855
33   0.66 0.06779661 0.8000000 0.12500000
34   0.68 0.05084746 0.7500000 0.09523810
35   0.70 0.03389831 1.0000000 0.06557377
36   0.72 0.00000000       NaN        NaN
37   0.74 0.00000000       NaN        NaN
38   0.76 0.00000000       NaN        NaN
39   0.78 0.00000000       NaN        NaN
40   0.80 0.00000000       NaN        NaN
41   0.82 0.00000000       NaN        NaN
42   0.84 0.00000000       NaN        NaN
43   0.86 0.00000000       NaN        NaN
44   0.88 0.00000000       NaN        NaN
45   0.90 0.00000000       NaN        NaN
46   0.92 0.00000000       NaN        NaN
47   0.94 0.00000000       NaN        NaN
48   0.96 0.00000000       NaN        NaN
49   0.98 0.00000000       NaN        NaN

We can see that the highest F1 score had a value of 0.5473 This took place at a cut-off of 0.18. Therefore, according to the F1 score at least, the optimal cut-off for our model is 0.18. In other words, if our model predicts a probability above 0.18 then we should call this an event. Any predicted probability below 0.18 should be called a non-event.

Another common calculation using the precision of a model is the model’s lift. The lift of a model is simply calculated as the ratio of Precision to the population proportion of the event - \(Lift = PPV/\pi_1\). The interpretation of lift is really nice for explanation. Let’s imagine that your lift was 3 and your population proportion of events was 0.2. This means that in the top 20% of your customers, your model predicted 3 times the events as compared to you selecting people at random. Sometimes people plot lift charts where they plot the precision at all the different values of the population proportion (called depth).

The performance function in R can easily calculate and plot the lift chart for us using the measure = "lift" and x.measure = "rpp" options. This plots the lift vs. the rate of positive predictions.

K-S Statistic

One of the most popular metrics for classification models in the finance and banking industry is the KS statistic. The two sample KS statistic can determine if there is a difference between two cumulative distribution functions. The two cumulative distribution functions of interest to us are the predicted probability distribution functions for the event and non-event target group. The KS \(D\) statistic is the maximum distance between these two curves - calculated by the maximum difference between the true positive and falsse positive rates, \(D = \max_{depth}{(TPR - FPR)}\).

The optimal cut-off for determining predicted events and non-events would be at the point where this is maximized.

Let’s see how to do this in each of our softwares!

SAS

The KS statistic is not limited only two logistic regression as it can estimate the difference between any two cumulative distribution functions. Therefore, we will need an additional procedure beyond PROC LOGISTIC. First, we must obtain the predicted probabilities of our training set through the OUTPUT statement of PROC LOGISTIC. We then input this data set of predictions into PROC NPAR1WAY. With the d option we can calculate the KS D statistic. The plot = edfplot option allows us to see the cumulative distributions that are being calculated. The CLASS statement defines the variable that specifies our two groups - here the target variable low. The VAR statement defines the distribution calculation variable - here the predicted probabilities phat.

ods html select ParameterEstimates;
proc logistic data=logistic.lowbwt;
    class race(ref='white') / param=ref; 
    model low(event='1') = race lwt smoke / clodds=pl clparm=pl;
    output out=predprobs p=phat;
run;

ods html select KSTest KS2Stats EDFPlot;
proc npar1way data=predprobs d plot=edfplot;
    class low;
    var phat;
run;
Analysis of Maximum Likelihood Estimates
Parameter DF Estimate Standard
Error
Wald
Chi-Square
Pr > ChiSq
Intercept 1 -0.1093 0.8821 0.0153 0.9014
race black 1 1.2900 0.5109 6.3766 0.0116
race other 1 0.9705 0.4122 5.5423 0.0186
lwt 1 -0.0133 0.00631 4.4149 0.0356
smoke 1 1.0600 0.3783 7.8500 0.0051



Kolmogorov-Smirnov Test for Variable phat
Classified by Variable low
low N EDF at
Maximum
Deviation from Mean
at Maximum
0 130 0.646154 1.032979
1 59 0.355932 -1.533336
Total 189 0.555556
Maximum Deviation Occurred at Observation 27
Value of phat at Maximum = 0.339202

Kolmogorov-Smirnov Two-Sample
Test (Asymptotic)
D = max |F1 - F2| 0.2902
Pr > D 0.0021
D+ = max (F1 - F2) 0.2902
Pr > D+ 0.0011
D- = max (F2 - F1) 0.0000
Pr > D- 1.0000

Empirical Distribution Function Plot for phat Classified by low


From the output we can see the KS D statistic at 0.2902. The predicted probability that this occurs at (the optimal cut-off) is defined at 0.339202. In other words, if our model predicts a probability above 0.34 then we should call this an event. Any predicted probability below 0.34 should be called a non-event, according to the KS statistic.

R

The performance function in R again helps us calculate the needed metric. Using the measure = "tpr" and x.measure = "fpr" options, we can calculate the true positive rate and false positive rate across all our predictions. From there we can just use the max function to calculate the value of maximum difference between the two - the KS statistic. Finding the cut-off at this point is a little trickier with some of the needed R functions, but we essentially search the for alpha values (here the cut-offs) for the point where the KS statistic is maximized.

[1] 0.2902216 0.3399442

From the output we can see the KS D statistic at 0.2902. The predicted probability that this occurs at (the optimal cut-off) is defined at 0.339202. In other words, if our model predicts a probability above 0.34 then we should call this an event. Any predicted probability below 0.34 should be called a non-event, according to the KS statistic.

Accuracy & Error

Accuracy and error rate are typically thought of when it comes to measuring the ability of a logistic regression model. Accuracy is essentially what percentage of events and non-events were predicted correctly.

Accuracy

Accuracy

The error would be the opposite of this.

Error

Error

However, caution should be used with these metrics as they can easily be fooled if only focusing on them. If your data has 10% events and 90% non-events, you can easily have a 90% accurate model by guessing non-events for every observation. Instead, having less accuracy might be better if we can predict both events and non-events.

Let’s see how to do this in each of our softwares!

SAS

PROC LOGISTIC produces classification tables for any cut-off that you determine. Remember that a cut-off is simply the point where you decide to predict an event and non-event from your predicted probabilities. The ctable option in the MODEL statement creates the classification table for us. The pprob option defines the cut-off (or range of cut-offs) for this classification table. We want to look at all classification tables for cut-offs ranging from a predicted probability of 0 to 0.98 by steps of 0.02. From there we use PROC SORT to rank these by the Correct variable produced in PROC LOGISTIC and print the first 5 observations with PROC PRINT.

ods html select Classification;
proc logistic data=logistic.lowbwt plots(only)=(oddsratio);
    class race(ref='white') / param=ref; 
    model low(event='1') = race lwt smoke / ctable pprob = 0 to 0.98 by 0.02;
    ods output classification=classtable;
    title 'Modeling Low Birth Weight';
run;
quit;

proc sort data=classtable;
    by descending Correct;
run;

proc print data=classtable (obs = 5);
run;
Modeling Low Birth Weight

Classification Table
Prob
Level
Correct Incorrect Percentages
Event Non-
Event
Event Non-
Event
Correct Sensi-
tivity
Speci-
ficity
Pos
Pred
Neg
Pred
0.000 59 0 130 0 31.2 100.0 0.0 31.2 .
0.020 59 0 130 0 31.2 100.0 0.0 31.2 .
0.040 59 0 130 0 31.2 100.0 0.0 31.2 .
0.060 59 1 129 0 31.7 100.0 0.8 31.4 100.0
0.080 59 4 126 0 33.3 100.0 3.1 31.9 100.0
0.100 59 10 120 0 36.5 100.0 7.7 33.0 100.0
0.120 58 15 115 1 38.6 98.3 11.5 33.5 93.8
0.140 57 26 104 2 43.9 96.6 20.0 35.4 92.9
0.160 54 35 95 5 47.1 91.5 26.9 36.2 87.5
0.180 53 43 87 6 50.8 89.8 33.1 37.9 87.8
0.200 53 45 85 6 51.9 89.8 34.6 38.4 88.2
0.220 51 47 83 8 51.9 86.4 36.2 38.1 85.5
0.240 48 51 79 11 52.4 81.4 39.2 37.8 82.3
0.260 47 53 77 12 52.9 79.7 40.8 37.9 81.5
0.280 45 56 74 14 53.4 76.3 43.1 37.8 80.0
0.300 43 62 68 16 55.6 72.9 47.7 38.7 79.5
0.320 38 69 61 21 56.6 64.4 53.1 38.4 76.7
0.340 33 82 48 26 60.8 55.9 63.1 40.7 75.9
0.360 26 90 40 33 61.4 44.1 69.2 39.4 73.2
0.380 19 98 32 40 61.9 32.2 75.4 37.3 71.0
0.400 14 105 25 45 63.0 23.7 80.8 35.9 70.0
0.420 9 111 19 50 63.5 15.3 85.4 32.1 68.9
0.440 9 113 17 50 64.6 15.3 86.9 34.6 69.3
0.460 9 119 11 50 67.7 15.3 91.5 45.0 70.4
0.480 9 119 11 50 67.7 15.3 91.5 45.0 70.4
0.500 9 120 10 50 68.3 15.3 92.3 47.4 70.6
0.520 9 122 8 50 69.3 15.3 93.8 52.9 70.9
0.540 8 123 7 51 69.3 13.6 94.6 53.3 70.7
0.560 8 123 7 51 69.3 13.6 94.6 53.3 70.7
0.580 7 124 6 52 69.3 11.9 95.4 53.8 70.5
0.600 7 124 6 52 69.3 11.9 95.4 53.8 70.5
0.620 7 125 5 52 69.8 11.9 96.2 58.3 70.6
0.640 4 127 3 55 69.3 6.8 97.7 57.1 69.8
0.660 3 127 3 56 68.8 5.1 97.7 50.0 69.4
0.680 2 127 3 57 68.3 3.4 97.7 40.0 69.0
0.700 0 129 1 59 68.3 0.0 99.2 0.0 68.6
0.720 0 130 0 59 68.8 0.0 100.0 . 68.8
0.740 0 130 0 59 68.8 0.0 100.0 . 68.8
0.760 0 130 0 59 68.8 0.0 100.0 . 68.8
0.780 0 130 0 59 68.8 0.0 100.0 . 68.8
0.800 0 130 0 59 68.8 0.0 100.0 . 68.8
0.820 0 130 0 59 68.8 0.0 100.0 . 68.8
0.840 0 130 0 59 68.8 0.0 100.0 . 68.8
0.860 0 130 0 59 68.8 0.0 100.0 . 68.8
0.880 0 130 0 59 68.8 0.0 100.0 . 68.8
0.900 0 130 0 59 68.8 0.0 100.0 . 68.8
0.920 0 130 0 59 68.8 0.0 100.0 . 68.8
0.940 0 130 0 59 68.8 0.0 100.0 . 68.8
0.960 0 130 0 59 68.8 0.0 100.0 . 68.8
0.980 0 130 0 59 68.8 0.0 100.0 . 68.8



Modeling Low Birth Weight

Obs ProbLevel CorrectEvents CorrectNonevents IncorrectEvents IncorrectNonevents Correct Sensitivity Specificity FalsePositive FalseNegative PPV NPV
1 0.620 7 125 5 52 69.8 11.9 96.2 41.7 29.4 58.3 70.6
2 0.520 9 122 8 50 69.3 15.3 93.8 47.1 29.1 52.9 70.9
3 0.540 8 123 7 51 69.3 13.6 94.6 46.7 29.3 53.3 70.7
4 0.560 8 123 7 51 69.3 13.6 94.6 46.7 29.3 53.3 70.7
5 0.580 7 124 6 52 69.3 11.9 95.4 46.2 29.5 53.8 70.5

From the output we can see the accuracy is maximized at 69.8%. The predicted probability that this occurs at (the optimal cut-off) is defined at 0.62. In other words, if our model predicts a probability above 0.62 then we should call this an event. Any predicted probability below 0.62 should be called a non-event, according to the accuracy metric.

There is more to model building than simply maximizing overall classification accuracy. These are good numbers to report, but not necessarily to choose models on.

R

R produces classification tables for any cut-off that you determine. Remember that a cut-off is simply the point where you decide to predict an event and non-event from your predicted probabilities. The confusionMatrix function creates the classification table for us. The threshold option defines the cut-off for this classification table.

We want to look at all classification tables for cut-offs ranging from a predicted probability of 0 to 0.98 by steps of 0.02. We can easily loop through this calculation with a for loop in R. In this loop we are calculating the error and accuracy at each threshold using the misClassError function. From there we combine these variables into a single data frame print the observations with the print function.

   cutoff  error accuracy
1    0.02 0.6878   0.3122
2    0.04 0.6878   0.3122
3    0.06 0.6825   0.3175
4    0.08 0.6667   0.3333
5    0.10 0.6296   0.3704
6    0.12 0.6032   0.3968
7    0.14 0.5397   0.4603
8    0.16 0.5026   0.4974
9    0.18 0.4815   0.5185
10   0.20 0.4762   0.5238
11   0.22 0.4656   0.5344
12   0.24 0.4550   0.5450
13   0.26 0.4603   0.5397
14   0.28 0.4550   0.5450
15   0.30 0.4339   0.5661
16   0.32 0.4074   0.5926
17   0.34 0.3545   0.6455
18   0.36 0.3333   0.6667
19   0.38 0.3386   0.6614
20   0.40 0.3492   0.6508
21   0.42 0.3280   0.6720
22   0.44 0.3122   0.6878
23   0.46 0.3175   0.6825
24   0.48 0.3122   0.6878
25   0.50 0.3069   0.6931
26   0.52 0.3016   0.6984
27   0.54 0.2963   0.7037
28   0.56 0.3016   0.6984
29   0.58 0.3016   0.6984
30   0.60 0.2963   0.7037
31   0.62 0.2910   0.7090
32   0.64 0.2910   0.7090
33   0.66 0.2963   0.7037
34   0.68 0.3016   0.6984
35   0.70 0.3016   0.6984
36   0.72 0.3122   0.6878
37   0.74 0.3122   0.6878
38   0.76 0.3122   0.6878
39   0.78 0.3122   0.6878
40   0.80 0.3122   0.6878
41   0.82 0.3122   0.6878
42   0.84 0.3122   0.6878
43   0.86 0.3122   0.6878
44   0.88 0.3122   0.6878
45   0.90 0.3122   0.6878
46   0.92 0.3122   0.6878
47   0.94 0.3122   0.6878
48   0.96 0.3122   0.6878
49   0.98 0.3122   0.6878

From the output we can see the accuracy is maximized at 70.4%. The predicted probability that this occurs at (the optimal cut-off) is defined at 0.60. In other words, if our model predicts a probability above 0.60 then we should call this an event. Any predicted probability below 0.60 should be called a non-event, according to the accuracy metric.

There is more to model building than simply maximizing overall classification accuracy. These are good numbers to report, but not necessarily to choose models on.

Optimal Cut-off

Classification is a decision that is extraneous to statistical modeling. Although logistic regressions tend to work well in classification, it is a probability model and does not output events and non-events.

Classification assumes cost for each observation is the same, which is rarely true. It is always better to consider the costs of false positives and false negatives when considering cut-offs in classification. The previous sections talk about many ways to determine “optimal” cut-offs when costs are either not known or not necessary. However, if costs are known, they should drive the cut-off decision.

Ordinal Logistic Regression

Logistic regression is not limited to only have binary categorical target variables. Logistic regression works with target variables with any number of categories. If these categories have inherent order to them we perform ordinal logistic regression. Binary logistic regression is actually a special case of ordinal logistic regression since binary variables can only be written in two directions which is the requirement of ordinal variables. If the categories have no inherent order to them we perform nominal (or multinomial) logistic regression. Here we will discuss ordinal logisitic regression.

Ordinal logistic regression models can also be used to model continuous taret variables that have bounds on them however this will not be covered here.

Ordinal logistic regression models are generalizations of binary logistic regression models. In binary logistic regression we are calculated the probability that an observation has an event. In ordinal logistic regression we are calculating the probability that an observation has at most that event in an ordered list of outcomes.

We will be using the wallet data set to model the association between various factors and different levels of ethical responses to finding a wallet - returning the wallet and everything in it, returning the wallet but keeping the money found in it, keeping both the wallet and the money found inside. The variables in the data set are the following:

Returning Wallet Data Dictionary
Variable Description
male indicator for a male student
business indicator for student enrolled in business school
punish how often the student was punished as a child (1 - low, 2 - moderate, 3 - high)
explain indicator for whether explanation for punishment was given

The most common and easiest to interpret approach for ordinal logistic regression is the cumulative logit approach in the proportional odds model.

Proportional Odds Model

Instead of modeling the typical logit from binary logistic regression, we will model the cumulative logits. These cumulative logits are built off target variables with \(m\) categories. The first logit will summarize the first category compared to the rest. The second logit will summarize the first two categories compared to the rest. This continues \(m-1\) times as the last logit compares the first \(m-1\) categories to the \(m^{th}\) category. In essense, with a target variable with \(m\) categories, we are building \(m-1\) logistic regressions. This is why binary logistic regression is a special case of the ordinal logistic regression model - 2 categories in the target variable leads to one logistic regression model.

The main assumption of the proportional odds model is that although the intercept changes across the \(m-1\) models, the slope parameters on each variable stay the same. This will make the effects proportional across the different logistic regressions.

Let’s see how to build these proportional odds models in each of our softwares!

SAS

PROC LOGISTIC is actually made to calculate ordinal logistic regression models as the binary logistic is just a special case. Notice how the code is the exact same, save one big omission. There is no event = option after the target variable since the target is just assumed to be built ordinally.

proc logistic data=Logistic.Wallet;
    class punish(param=ref ref='1');
    model wallet = male business punish explain / clodds=pl;
    title 'Ordinal Logistic Regression - Ascending';
run;
quit;
Ordinal Logistic Regression - Ascending

Model Information
Data Set LOGISTIC.WALLET
Response Variable wallet
Number of Response Levels 3
Model cumulative logit
Optimization Technique Fisher's scoring

Number of Observations Read 195
Number of Observations Used 195

Response Profile
Ordered
Value
wallet Total
Frequency
1 1 24
2 2 50
3 3 121

Probabilities modeled are cumulated over the lower Ordered Values.


Class Level Information
Class Value Design Variables
punish 1 0 0
2 1 0
3 0 1

Model Convergence Status
Convergence criterion (GCONV=1E-8) satisfied.

Score Test for the Proportional
Odds Assumption
Chi-Square DF Pr > ChiSq
5.2215 5 0.3894

Model Fit Statistics
Criterion Intercept Only Intercept and
Covariates
AIC 356.140 321.335
SC 362.686 344.246
-2 Log L 352.140 307.335

Testing Global Null Hypothesis: BETA=0
Test Chi-Square DF Pr > ChiSq
Likelihood Ratio 44.8047 5 <.0001
Score 40.9509 5 <.0001
Wald 38.5978 5 <.0001

Type 3 Analysis of Effects
Effect DF Wald
Chi-Square
Pr > ChiSq
male 1 10.6047 0.0011
business 1 4.4167 0.0356
punish 2 9.4185 0.0090
explain 1 9.4925 0.0021

Analysis of Maximum Likelihood Estimates
Parameter DF Estimate Standard
Error
Wald
Chi-Square
Pr > ChiSq
Intercept 1 1 -2.5678 0.4169 37.9321 <.0001
Intercept 2 1 -0.7890 0.3675 4.6107 0.0318
male 1 1.0598 0.3254 10.6047 0.0011
business 1 0.7389 0.3516 4.4167 0.0356
punish 2 1 0.6277 0.4005 2.4564 0.1170
punish 3 1 1.4031 0.4721 8.8330 0.0030
explain 1 -1.0518 0.3414 9.4925 0.0021

Association of Predicted Probabilities and
Observed Responses
Percent Concordant 69.0 Somers' D 0.465
Percent Discordant 22.5 Gamma 0.508
Percent Tied 8.5 Tau-a 0.249
Pairs 10154 c 0.732

Odds Ratio Estimates and Profile-Likelihood Confidence Intervals
Effect Unit Estimate 95% Confidence Limits
male 1.0000 2.886 1.533 5.557
business 1.0000 2.094 1.039 4.206
punish 2 vs 1 1.0000 1.873 0.838 4.124
punish 3 vs 1 1.0000 4.068 1.583 10.616
explain 1.0000 0.349 0.178 0.681

Plot of Odds Ratios with 95% Profile-Likelihood Confidence Limits


In the output above we see similar things to the output we have seen previously. PROC LOGISTIC summarizes the model used, the number of observations read and used in the model, and builds a frequency table for our target variable. It then summarizes the categorical variable created in the CLASS statement followed by telling us that the model converged. It also provides us with the same global tests, type 3 analysis of effects, parameter estimates, association metrics, and odds ratios we have discussed in previous sections. The main difference in the parameter estimates is the multiple intercepts. Since we have three levels of our target variable we have two logistic regression models and therefore two intercepts.

The main question from here is whether the assumption of proportional odds actually holds. By default SAS provides us this answer with the Score test for the proportional odds assumption. The output is again highlighted below.

ods html select CumulativeModelTest;
proc logistic data=Logistic.Wallet;
    class punish(param=ref ref='1');
    model wallet = male business punish explain / clodds=pl;
    title 'Ordinal Logistic Regression - Ascending';
run;
quit;
Ordinal Logistic Regression - Ascending

Score Test for the Proportional
Odds Assumption
Chi-Square DF Pr > ChiSq
5.2215 5 0.3894

The null hypothesis for the above test is that the assumption holds - the slopes parameter across all of the variables in the separate logistic regression models are equal. Since our p-value of 0.3894 is above any reasonable significance level for this sample size, we have no evidence to say the assumption doesn’t hold.

The Score test is much like a global test across all the variables to see if the assumption holds. However, what if this test tells us the assumption doesn’t hold? Which variable(s) does it break on? Again, we go to the TEST statement in PROC LOGISTIC to answer this. We can test the same variable in different models by using an "_" followed by the model. For example, we want to know if the business variable is the same across models 1 and 2. This is given in the second TEST statement below.

ods html select TestStmts;
proc logistic data=Logistic.Wallet;
    class punish(param=ref ref='1');
    model wallet = male business punish explain / unequalslopes clodds=pl;
    male: test male_1 = male_2;
    business: test business_1 = business_2;
    punish2: test punish2_1 = punish2_2;
    punish3: test punish3_1 = punish3_2;
    explain: test explain_1 = explain_2;
    title 'Ordinal Logistic Regression - Unequal Slopes';
run;
quit;
Ordinal Logistic Regression - Unequal Slopes

Linear Hypotheses Testing Results
Label Wald
Chi-Square
DF Pr > ChiSq
male 0.7582 1 0.3839
business 0.9922 1 0.3192
punish2 0.7561 1 0.3845
punish3 3.0150 1 0.0825
explain 0.2850 1 0.5935


The results from the individual TEST statements agree with the global test, that there is no evidence to support there being different slope parameters for the same variables across different models. This is not surprising. These individual tests should only be run if the global test says the assumption doesn’t hold. They were performed here only for completeness.

R

In R, we need to use the polr function instead of the glm function to perform ordinal logistic regression models. The formula piece of the function is much the same as with the glm function. The method = option specifies that we are performing a logistic regression while the data = option specifies the data frame being used.


Re-fitting to get Hessian
Call:
polr(formula = factor(wallet) ~ male + business + punish + explain, 
    data = train, method = "logistic")

Coefficients:
           Value Std. Error t value
male     -1.0598     0.3274  -3.237
business -0.7389     0.3556  -2.078
punish2  -0.6276     0.4048  -1.551
punish3  -1.4031     0.4823  -2.909
explain   1.0519     0.3408   3.086

Intercepts:
    Value   Std. Error t value
1|2 -2.5679  0.4190    -6.1287
2|3 -0.7890  0.3709    -2.1273

Residual Deviance: 307.3349 
AIC: 321.3349 

In the output above we see similar things to the output we have seen previously. R summarizes the model used as well as the parameter estimates of the model. The main difference in the parameter estimates is the multiple intercepts. Since we have three levels of our target variable we have two logistic regression models and therefore two intercepts.

The main question from here is whether the assumption of proportional odds actually holds. By default R doesn’t provide a global test to answer this. However, the Brant test in the brant function will test each of the variables in the separate logistic regression models to see if their slope parameters are the same.

-------------------------------------------- 
Test for    X2  df  probability 
-------------------------------------------- 
Omnibus     5.46    5   0.36
male        0.51    1   0.47
business    0.58    1   0.45
punish2     0.99    1   0.32
punish3     2.81    1   0.09
explain     0.25    1   0.62
-------------------------------------------- 

H0: Parallel Regression Assumption holds
                X2 df probability
Omnibus  5.4618058  5  0.36215220
male     0.5123944  1  0.47410417
business 0.5791753  1  0.44663576
punish2  0.9871507  1  0.32043977
punish3  2.8104051  1  0.09365472
explain  0.2468865  1  0.61927599

The null hypothesis for the above tests are that the assumption holds - the slopes parameter across all of the variables in the separate logistic regression models are equal. Since our p-values are above any reasonable significance level for this sample size, we have no evidence to say the assumption doesn’t hold for any of the variables.

Partial Proportional Odds Model

Although our data doesn’t have the proportional odds assumption fail in any of the variables, what would we do if it did? The partial propotional odds model is a model where some (not all) of the variables don’t follow the proportional odds assumption. If none of the variables follow the proportional odds assumption, then we should build a nominal (or multinomial) logistic regression summarized in the next section.

Let’s see how each of our softwares can build partial proportional odds models!

SAS

PROC LOGISTIC can easily perform the partial proportional odds model with one additional option added on - the unequalslopes = option. Here we just list what variables needs to be built without the proportional odds assumption. Essentially, this means that those variables will have different slope parameters in the different models.

ods html select ParameterEstimates;
proc logistic data=Logistic.Wallet;
    class punish(param=ref ref='1');
    model wallet = male business punish explain / unequalslopes=(business) clodds=pl;
    title 'Partial Proportional Odds Model';
run;
quit;
Partial Proportional Odds Model

Analysis of Maximum Likelihood Estimates
Parameter wallet DF Estimate Standard
Error
Wald
Chi-Square
Pr > ChiSq
Intercept 1 1 -2.6695 0.4491 35.3260 <.0001
Intercept 2 1 -0.7730 0.3714 4.3318 0.0374
male 1 1.0707 0.3282 10.6426 0.0011
business 1 1 0.9722 0.4838 4.0389 0.0445
business 2 1 0.6376 0.3810 2.7996 0.0943
punish 2 1 0.6300 0.4048 2.4224 0.1196
punish 3 1 1.3956 0.4829 8.3523 0.0039
explain 1 -1.0532 0.3405 9.5660 0.0020

The above output is the same as the proportional odds model output with one difference. There are two slope parameter estimates for the business variable - one for each model. The rest of the outout is the same as before.

R

R uses the vglm function to perform the partial proportional odds model. The formula piece of the function is the same as previous functions used as well as the data = option. THe main difference is the family = option that specifies the cumulative logit. In the cumulative option we have the parallel = F option followed by which variables we think the assumption doesn’t hold for - namely the business variable.


Call:
vglm(formula = factor(wallet) ~ male + business + punish + explain, 
    family = cumulative(parallel = F ~ business), data = train)

Pearson residuals:
                      Min      1Q  Median      3Q   Max
logitlink(P[Y<=1]) -1.241 -0.4759 -0.1765 -0.1099 3.899
logitlink(P[Y<=2]) -1.850 -0.6623 -0.3859  0.6425 2.730

Coefficients: 
              Estimate Std. Error z value Pr(>|z|)    
(Intercept):1  -2.6695     0.4466  -5.978 2.26e-09 ***
(Intercept):2  -0.7730     0.3678  -2.102  0.03557 *  
male            1.0707     0.3258   3.287  0.00101 ** 
business:1      0.9722     0.4789   2.030  0.04236 *  
business:2      0.6376     0.3810   1.674  0.09423 .  
punish2         0.6300     0.4008   1.572  0.11594    
punish3         1.3956     0.4727   2.952  0.00316 ** 
explain        -1.0532     0.3413  -3.086  0.00203 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Names of linear predictors: logitlink(P[Y<=1]), logitlink(P[Y<=2])

Residual deviance: 306.8392 on 382 degrees of freedom

Log-likelihood: -153.4196 on 382 degrees of freedom

Number of Fisher scoring iterations: 5 

No Hauck-Donner effect found in any of the estimates


Exponentiated coefficients:
      male business:1 business:2    punish2    punish3    explain 
 2.9174165  2.6438481  1.8918696  1.8776731  4.0373205  0.3488076 

The above output is the same as the proportional odds model output with one difference. There are two slope parameter estimates for the business variable - one for each model. The rest of the outout is the same as before.

Interpretation

To understand the interpretation of the odds ratios in ordinal logistic regression we need to remember how the logistic regression equations are structured. Instead of modeling the typical logit from binary logistic regression, we will model the cumulative logits. These cumulative logits are built off target variables with \(m\) categories. The first logit will summarize the first category compared to the rest. The second logit will summarize the first two categories compared to the rest. This continues \(m-1\) times as the last logit compares the first \(m-1\) categories to the \(m^{th}\) category. The question really becomes what is first. Is it the largest category or the smallest category? The answer to this question will influence the interpretation of the results. If the first category is the smallest category then the categories are ascending as they build up further logistic regression models. If the first category is the largest category then the categories are descending as they build up further logistic regression models.

Ascending

By default, PROC LOGISTIC in SAS builds ordinal logistic regression models in ascending order of categories. The lowest valued category is the starting point and each additional logistic regression adds the next lowest valued category - building from the bottom up. There are no additional options needed as this is the default in SAS.

ods html select CLOddsPL;
proc logistic data=Logistic.Wallet;
    class punish(param=ref ref='1');
    model wallet = male business punish explain / clodds=pl;
    title 'Ordinal Logistic Regression - Ascending';
run;
quit;
Ordinal Logistic Regression - Ascending

Odds Ratio Estimates and Profile-Likelihood Confidence Intervals
Effect Unit Estimate 95% Confidence Limits
male 1.0000 2.886 1.533 5.557
business 1.0000 2.094 1.039 4.206
punish 2 vs 1 1.0000 1.873 0.838 4.124
punish 3 vs 1 1.0000 4.068 1.583 10.616
explain 1.0000 0.349 0.178 0.681

Let’s see how the odds ratios are now interpreted with the ascending order of categories. Odds ratios in ascending ordinal logistic regressions are interpreted as having higher odds of being in a lower category for the target variable. For the male variable, males are 2.9 times more likely to have a lower value on our ethical scale as compared to non-males. Let’s remember how our scale is calculated. A value of 1 keeps the wallet and the money, 2 keeps the money but returns the wallet, and 3 returns both the money and the wallet. For the business variable, business school students are 2.1 times more likely to be lower on the ethical scale.

Descending

Both SAS and R can calculate descending category ordinal logistic regressions. The highest valued category is the starting point and each additional logistic regression adds the next highest valued category - building from the top down.

SAS

In SAS we need one additional option to switch our PROC LOGISTIC model from ascending to descending interpretation. The desc option will switch the order of the logistic regression estimation.

ods html select CLOddsPL;
proc logistic data=Logistic.Wallet desc;
    class punish(param=ref ref='1');
    model wallet = male business punish explain / clodds=pl;
    title 'Ordinal Logistic Regression - Descending';
run;
quit;
Ordinal Logistic Regression - Descending

Odds Ratio Estimates and Profile-Likelihood Confidence Intervals
Effect Unit Estimate 95% Confidence Limits
male 1.0000 0.347 0.180 0.652
business 1.0000 0.478 0.238 0.963
punish 2 vs 1 1.0000 0.534 0.242 1.193
punish 3 vs 1 1.0000 0.246 0.094 0.632
explain 1.0000 2.863 1.469 5.612

Let’s see how the odds ratios are now interpreted with the descending order of categories. Odds ratios in descending ordinal logistic regressions are interpreted as having higher odds of being in a higher category for the target variable. For the male variable, males are 0.347 times more likely to have a higher value on our ethical scale as compared to non-males. A better way to think of this is that males have 65.3% lower odds of being in a higher ethical scale. Let’s remember how our scale is calculated. A value of 1 keeps the wallet and the money, 2 keeps the money but returns the wallet, and 3 returns both the money and the wallet. For the business variable, business school students have 52.2% lower odds of being in a higher ethical category as compared to non business school students.

R

By default, the polr function in R builds ordinal logistic regression models in descending order of categories. To get the odds ratios from R we just exponentiate the parameter estimates from the logistic regression model.

                OR      lower     upper
male     0.3465172 0.17995548 0.6522567
business 0.4776512 0.23775967 0.9626137
punish2  0.5338490 0.24246863 1.1934058
punish3  0.2458364 0.09418942 0.6315739
explain  2.8630214 1.46945567 5.6121973

Let’s see how the odds ratios are now interpreted with the descending order of categories. Odds ratios in descending ordinal logistic regressions are interpreted as having higher odds of being in a higher category for the target variable. For the male variable, males are 0.347 times more likely to have a higher value on our ethical scale as compared to non-males. A better way to think of this is that males have 65.3% lower odds of being in a higher ethical scale. Let’s remember how our scale is calculated. A value of 1 keeps the wallet and the money, 2 keeps the money but returns the wallet, and 3 returns both the money and the wallet. For the business variable, business school students have 52.2% lower odds of being in a higher ethical category as compared to non business school students.

Predictions & Diagnostics

Ordinal logistic regression has a lot of similarities to binary logistic regression:

  • Multicollinearity still exists
  • Non-convergence problems still exists
  • Concordance, Discordance, Tied pairs still exist and so does the \(c\) statistic
  • Generalized \(R^2\) remains the same

There are some inherent differences though between binary and ordinal logistic regression. A lot of the diagnostics cannot be calculated for ordinal logistic regression. ROC curves and residuals cannot typically be calculated because there is actually more than one logistic regression occurring.

Predicted probabilities are actually for each category though. Let’s see how we can get predicted probabilities from each of our softwares!

SAS

In PROC LOGISTIC we have the same process of getting predicted probabilities in ordinal logistic regression as we did for binary logistic regression. The OUTPUT statement will score the training set, while the SCORE statement will score the validation data set. The predprobs = I option is used to get individual predicted probabilities for each category.

ods html select Association;
proc logistic data=Logistic.Wallet;
    class punish(param=ref ref='1');
    model wallet = male business punish explain;
    output out=pred predprobs=I;
run;
quit;

proc print data=pred (obs = 5);
run;

proc freq data=pred;
   tables _from_*_into_ / plots=none;
   title 'Crosstabulation of Observed Responses by Predicted Responses';
run;
quit;
Association of Predicted Probabilities and
Observed Responses
Percent Concordant 69.0 Somers' D 0.465
Percent Discordant 22.5 Gamma 0.508
Percent Tied 8.5 Tau-a 0.249
Pairs 10154 c 0.732



Obs wallet male business punish explain FROM INTO IP_1 IP_2 IP_3
1 2 0 0 2 0 2 3 0.12563 0.33412 0.54026
2 2 0 0 2 1 2 3 0.04779 0.18135 0.77087
3 3 0 0 1 1 3 3 0.02609 0.11086 0.86305
4 3 0 0 2 0 3 3 0.12563 0.33412 0.54026
5 1 1 0 1 1 1 3 0.07177 0.24233 0.68591



Crosstabulation of Observed Responses by Predicted Responses

Frequency
Percent
Row Pct
Col Pct
Table of FROM by INTO
FROM(Formatted
Value of the Observed
Response)
INTO(Formatted Value of the Predicted
Response)
1 2 3 Total
1
4
2.05
16.67
57.14
11
5.64
45.83
34.38
9
4.62
37.50
5.77
24
12.31
2
3
1.54
6.00
42.86
9
4.62
18.00
28.13
38
19.49
76.00
24.36
50
25.64
3
0
0.00
0.00
0.00
12
6.15
9.92
37.50
109
55.90
90.08
69.87
121
62.05
Total
7
3.59
32
16.41
156
80.00
195
100.00

The IP variables are the individual predicted probabilities for each of the categories of our target variable. Unlike binary logistic regression where we typically did not use a cut-off of 0.5 to determine which category we predict, in ordinal logistic regression we typically pick the category with the highest probability as our predicted category. The FROM and INTO variables are the actual target variable category and the predicted category for the target respectively.

To see how the predicted categories compare with the actual target categories we can use PROC FREQ with a TABLES statement on the FROM and INTO variables.

Accuracy is typically not as high in logistic regressions with more than two categories because there are more chances for incorrect predictions. However, people do rank errors in ordinal logistic regression. For example, if the actual ethic score was 1, then a prediction of 2 is better than a prediction of 3.

R

In R we have the same process of getting predicted probabilities in ordinal logistic regression as we did for binary logistic regression. The predict function will score any data set. The type = probs option is used to get individual predicted probabilities for each category.

           1         2         3
1 0.12562481 0.3341195 0.5402557
2 0.04778463 0.1813420 0.7708734
3 0.02609095 0.1108549 0.8630542
4 0.12562481 0.3341195 0.5402557
5 0.07176375 0.2423258 0.6859105
6 0.02609095 0.1108549 0.8630542

The three variables are the individual predicted probabilities for each of the categories of our target variable. Unlike binary logistic regression where we typically did not use a cut-off of 0.5 to determine which category we predict, in ordinal logistic regression we typically pick the category with the highest probability as our predicted category.

Accuracy is typically not as high in logistic regressions with more than two categories because there are more chances for incorrect predictions. However, people do rank errors in ordinal logistic regression. For example, if the actual ethic score was 1, then a prediction of 2 is better than a prediction of 3.

Nominal Logisitic Regression

It is easy to generalize the binary logistic regression model to the ordinal logistic regression model. However, we need to change the underlying model and math slightly to extend to nominal target variables.

In binary logistic regression we are calculated the probability that an observation has an event. In ordinal logistic regression we are calculating the probability that an observation has at most that event in an ordered list of outcomes. In nominal (or multinomial) logistic regression we are calculating the probability that an observation has a specific event in an unordered list of events.

We will be using the aliigator data set to model the association between various factors and alligator food choices - fish, invertibrate, birds, reptiles, and other. The variables in the data set are the following:

Alligator Food Data Dictionary
Variable Description
size small (< 2.3 meters) or large (> 2.3 meters)
lake lake captured (George, Hancock, Oklawaha, Trafford)
gender male or female
count number of observations with characteristics

Generalized Logit Model

Instead of modeling the typical logit from binary or ordinal logistic regression, we will model the generalized logits. These generalized logits are built off target variables with \(m\) categories. The first logit will summarize the first category compared to the reference category. The second logit will summarize the second category compared to the reference. This continues \(m-1\) times as the last logit compares the \(m-1\) category to the reference category. In essense, with a target variable with \(m\) categories, we are building \(m-1\) logistic regressions. However, unlike the ordinal logistic regression where we have different intercepts with the same slope parameters, in nominal logistic regression we have both different intercepts and different slope parameters for each model.

The logits are also not traditional logits. Instead of the natural log of the odds, they are natural logs of relative risks - ratios of two probabilities that do not sum to 1.

Let’s see how to build these generalized logit models in each of our softwares!

SAS

The data set is structured slightly differently than previous data sets. Let’s examine the first 10 observations with PROC PRINT and the obs = 10 option.

proc print data=logistic.gator (obs = 10);
run;
Obs size food lake gender count
1 <= 2.3 meters Fish Hancock Male 7
2 <= 2.3 meters Invertebrate Hancock Male 1
3 <= 2.3 meters Other Hancock Male 5
4 > 2.3 meters Fish Hancock Male 4
5 > 2.3 meters Bird Hancock Male 1
6 > 2.3 meters Other Hancock Male 2
7 <= 2.3 meters Fish Hancock Female 16
8 <= 2.3 meters Invertebrate Hancock Female 3
9 <= 2.3 meters Reptile Hancock Female 2
10 <= 2.3 meters Bird Hancock Female 2

Notice the count variable. In the first observation we can see that we have seven alligators that are small, males captured on Lake Hancock and eat fish. The same applies for each row in our data set. Even with only 56 rows in our data set, we have 219 observations based on the counts.

To account for the multiple counts in a single row of our data set we use the FREQ statement in PROC LOGISTIC with the count variable. The CLASS statement still defines our categorical variables for lake, size, and gender. The target variable food has a reference level specified with the ref = option. Here, our reference level is fish because it has the largest proportion of our observations. The link = glogit option tells PROC LOGISTIC to use the generalized logit model.

proc logistic data=Logistic.Gator plot(only)=oddsratio(range=clip);
    freq count;
    class lake(param=ref ref='George') size(param=ref ref='<= 2.3 meters') gender(param=ref ref='Male');
    model food(ref='Fish') = lake size gender / link=glogit clodds=pl;
    title 'Model on Alligator Food Choice';
run;
quit;
Model on Alligator Food Choice

Model Information
Data Set LOGISTIC.GATOR
Response Variable food
Number of Response Levels 5
Frequency Variable count
Model generalized logit
Optimization Technique Newton-Raphson

Number of Observations Read 56
Number of Observations Used 56
Sum of Frequencies Read 219
Sum of Frequencies Used 219

Response Profile
Ordered
Value
food Total
Frequency
1 Bird 13
2 Fish 94
3 Invertebrate 61
4 Other 32
5 Reptile 19

Logits modeled use food=‘Fish’ as the reference category.


Class Level Information
Class Value Design Variables
lake George 0 0 0
Hancock 1 0 0
Oklawaha 0 1 0
Trafford 0 0 1
size <= 2.3 meters 0
> 2.3 meters 1
gender Female 1
Male 0

Model Convergence Status
Convergence criterion (GCONV=1E-8) satisfied.

Model Fit Statistics
Criterion Intercept Only Intercept and
Covariates
AIC 612.363 585.865
SC 625.919 667.203
-2 Log L 604.363 537.865

Testing Global Null Hypothesis: BETA=0
Test Chi-Square DF Pr > ChiSq
Likelihood Ratio 66.4974 20 <.0001
Score 59.4616 20 <.0001
Wald 51.2336 20 0.0001

Type 3 Analysis of Effects
Effect DF Wald
Chi-Square
Pr > ChiSq
lake 12 36.2293 0.0003
size 4 15.8873 0.0032
gender 4 2.1850 0.7018

Analysis of Maximum Likelihood Estimates
Parameter food DF Estimate Standard
Error
Wald
Chi-Square
Pr > ChiSq
Intercept Bird 1 -3.0385 0.8319 13.3402 0.0003
Intercept Invertebrate 1 -0.2939 0.3553 0.6845 0.4080
Intercept Other 1 -1.6833 0.5210 10.4397 0.0012
Intercept Reptile 1 -4.0428 1.1836 11.6667 0.0006
lake Hancock Bird 1 0.5753 0.7952 0.5233 0.4694
lake Hancock Invertebrate 1 -1.7805 0.6232 8.1623 0.0043
lake Hancock Other 1 0.7666 0.5686 1.8179 0.1776
lake Hancock Reptile 1 1.1287 1.1925 0.8959 0.3439
lake Oklawaha Bird 1 -0.5504 1.2099 0.2069 0.6492
lake Oklawaha Invertebrate 1 0.9132 0.4761 3.6786 0.0551
lake Oklawaha Other 1 0.0261 0.7778 0.0011 0.9733
lake Oklawaha Reptile 1 2.5295 1.1218 5.0845 0.0241
lake Trafford Bird 1 1.2370 0.8661 2.0398 0.1532
lake Trafford Invertebrate 1 1.1558 0.4928 5.5013 0.0190
lake Trafford Other 1 1.5578 0.6257 6.1987 0.0128
lake Trafford Reptile 1 3.0603 1.1294 7.3423 0.0067
size > 2.3 meters Bird 1 0.7302 0.6523 1.2533 0.2629
size > 2.3 meters Invertebrate 1 -1.3363 0.4112 10.5606 0.0012
size > 2.3 meters Other 1 -0.2906 0.4599 0.3992 0.5275
size > 2.3 meters Reptile 1 0.5570 0.6466 0.7421 0.3890
gender Female Bird 1 0.6064 0.6888 0.7750 0.3787
gender Female Invertebrate 1 0.4630 0.3955 1.3701 0.2418
gender Female Other 1 0.2526 0.4663 0.2933 0.5881
gender Female Reptile 1 0.6275 0.6852 0.8387 0.3598

Odds Ratio Estimates and Profile-Likelihood Confidence Intervals
Effect food Unit Estimate 95% Confidence Limits
lake Hancock vs George Bird 1.0000 1.778 0.384 9.612
lake Hancock vs George Invertebrate 1.0000 0.169 0.044 0.528
lake Hancock vs George Other 1.0000 2.152 0.727 6.960
lake Hancock vs George Reptile 1.0000 3.092 0.364 65.177
lake Oklawaha vs George Bird 1.0000 0.577 0.027 5.084
lake Oklawaha vs George Invertebrate 1.0000 2.492 0.993 6.479
lake Oklawaha vs George Other 1.0000 1.026 0.194 4.516
lake Oklawaha vs George Reptile 1.0000 12.547 1.945 248.047
lake Trafford vs George Bird 1.0000 3.445 0.631 20.908
lake Trafford vs George Invertebrate 1.0000 3.177 1.228 8.557
lake Trafford vs George Other 1.0000 4.748 1.431 17.088
lake Trafford vs George Reptile 1.0000 21.334 3.282 426.076
size > 2.3 meters vs <= 2.3 meters Bird 1.0000 2.076 0.588 7.943
size > 2.3 meters vs <= 2.3 meters Invertebrate 1.0000 0.263 0.114 0.576
size > 2.3 meters vs <= 2.3 meters Other 1.0000 0.748 0.298 1.827
size > 2.3 meters vs <= 2.3 meters Reptile 1.0000 1.745 0.505 6.565
gender Female vs Male Bird 1.0000 1.834 0.472 7.345
gender Female vs Male Invertebrate 1.0000 1.589 0.731 3.464
gender Female vs Male Other 1.0000 1.287 0.512 3.222
gender Female vs Male Reptile 1.0000 1.873 0.483 7.358

Plot of Odds Ratios with 95% Profile-Likelihood Confidence Limits


In the output above we see similar things to the output we have seen previously. PROC LOGISTIC summarizes the model used, the number of observations read and used in the model, and builds a frequency table for our target variable. It then summarizes the categorical variables created in the CLASS statement followed by telling us that the model converged. It also provides us with the same global tests, type 3 analysis of effects, parameter estimates, and odds ratios we have discussed in previous sections. The main difference in the parameter estimates is the multiple intercepts and slope coefficients for each variable. Since we have five levels of our target variable we have four logistic regression models and therefore four intercepts and four slope parameters for each variable.

R

The data set is structured slightly differently than previous data sets. Let’s examine the first 10 observations with the head function and the n = 10 option.

            size         food    lake gender count
1  <= 2.3 meters         Fish Hancock   Male     7
2  <= 2.3 meters Invertebrate Hancock   Male     1
3  <= 2.3 meters        Other Hancock   Male     5
4   > 2.3 meters         Fish Hancock   Male     4
5   > 2.3 meters         Bird Hancock   Male     1
6   > 2.3 meters        Other Hancock   Male     2
7  <= 2.3 meters         Fish Hancock Female    16
8  <= 2.3 meters Invertebrate Hancock Female     3
9  <= 2.3 meters      Reptile Hancock Female     2
10 <= 2.3 meters         Bird Hancock Female     2

Notice the count variable. In the first observation we can see that we have seven alligators that are small, males captured on Lake Hancock and eat fish. The same applies for each row in our data set. Even with only 56 rows in our data set, we have 219 observations based on the counts.

To account for the multiple counts in a single row of our data set we use the weight = option in the multinom function with the count variable. The formula structure for the multinom function is the same as with previous functions in logistic regression. The target variable food needs to be turned into a factor with the factor function. We set the reference level specified with the relevel function. Here, our reference level is fish because it has the largest proportion of our observations. Similar to before, the summary function will display our needed results.

# weights:  35 (24 variable)
initial  value 352.466903 
iter  10 value 271.274792
iter  20 value 268.935538
final  value 268.932740 
converged
Call:
multinom(formula = food ~ size + lake + gender, data = gator, 
    weights = count)

Coefficients:
             (Intercept) size> 2.3 meters lakeHancock lakeOklawaha lakeTrafford
Bird          -2.4321397        0.7300740   0.5754699  -0.55020075     1.237216
Invertebrate   0.1690702       -1.3361658  -1.7805555   0.91304120     1.155722
Other         -1.4309095       -0.2905697   0.7667093   0.02603021     1.557820
Reptile       -3.4161432        0.5571846   1.1296426   2.53024945     3.061087
             genderMale
Bird         -0.6064035
Invertebrate -0.4629388
Other        -0.2524299
Reptile      -0.6276217

Std. Errors:
             (Intercept) size> 2.3 meters lakeHancock lakeOklawaha lakeTrafford
Bird           0.7706720        0.6522657   0.7952303    1.2098680    0.8661052
Invertebrate   0.3787475        0.4111827   0.6232075    0.4761068    0.4927795
Other          0.5381162        0.4599317   0.5685673    0.7777958    0.6256868
Reptile        1.0851582        0.6466092   1.1928075    1.1221413    1.1297557
             genderMale
Bird          0.6888385
Invertebrate  0.3955162
Other         0.4663546
Reptile       0.6852750

Residual Deviance: 537.8655 
AIC: 585.8655 

In the output above we see similar things to the output we have seen previously. R summarizes the parameter estimates and their standard errors. The main difference in the parameter estimates is the multiple intercepts and slope coefficients for each variable. Since we have five levels of our target variable we have four logistic regression models and therefore four intercepts and four slope parameters for each variable.

Interpretation

Similar to binary and ordinal logistic regression models, we exponentiate the coefficients in our nominal logistic regression model to make them interpretable. However, the interpretation changes since these are not odds ratios anymore, but relative risk ratios. Let’s use the coefficient for the birds model and the size variable. It would be incorrect to say that the probability of eating birds is 2.076 times more likely for large alligators compared to small alligators. The correct interpretation would be that the predicted relative probability of eating birds rather than fish is 2.076 times more likely in large alligators compared to small alligators. Notice how our interpretation is in comparison to the reference level. Sometimes these are called conditional interpretations.

Let’s see how to get these from each of our softwares!

SAS

By default, SAS produces the odds ratio table that shows the exponentiated coefficients. Even those these are not actually considered odds ratios in the traditional sense, the calculation in the odds ratio table is still correct.

ods html select CLOddsPL;
proc logistic data=Logistic.Gator plot(only)=oddsratio(range=clip);
    freq count;
    class lake(param=ref ref='George') size(param=ref ref='<= 2.3 meters') gender(param=ref ref='Male');
    model food(ref='Fish') = lake size gender / link=glogit clodds=pl;
    title 'Model on Alligator Food Choice';
run;
quit;
Model on Alligator Food Choice

Odds Ratio Estimates and Profile-Likelihood Confidence Intervals
Effect food Unit Estimate 95% Confidence Limits
lake Hancock vs George Bird 1.0000 1.778 0.384 9.612
lake Hancock vs George Invertebrate 1.0000 0.169 0.044 0.528
lake Hancock vs George Other 1.0000 2.152 0.727 6.960
lake Hancock vs George Reptile 1.0000 3.092 0.364 65.177
lake Oklawaha vs George Bird 1.0000 0.577 0.027 5.084
lake Oklawaha vs George Invertebrate 1.0000 2.492 0.993 6.479
lake Oklawaha vs George Other 1.0000 1.026 0.194 4.516
lake Oklawaha vs George Reptile 1.0000 12.547 1.945 248.047
lake Trafford vs George Bird 1.0000 3.445 0.631 20.908
lake Trafford vs George Invertebrate 1.0000 3.177 1.228 8.557
lake Trafford vs George Other 1.0000 4.748 1.431 17.088
lake Trafford vs George Reptile 1.0000 21.334 3.282 426.076
size > 2.3 meters vs <= 2.3 meters Bird 1.0000 2.076 0.588 7.943
size > 2.3 meters vs <= 2.3 meters Invertebrate 1.0000 0.263 0.114 0.576
size > 2.3 meters vs <= 2.3 meters Other 1.0000 0.748 0.298 1.827
size > 2.3 meters vs <= 2.3 meters Reptile 1.0000 1.745 0.505 6.565
gender Female vs Male Bird 1.0000 1.834 0.472 7.345
gender Female vs Male Invertebrate 1.0000 1.589 0.731 3.464
gender Female vs Male Other 1.0000 1.287 0.512 3.222
gender Female vs Male Reptile 1.0000 1.873 0.483 7.358

R

As with previous models, we just need to use the exp function to exponentiate the coefficients from our nominal logistic regression model to get the relative risk ratios.

             (Intercept) size> 2.3 meters lakeHancock lakeOklawaha lakeTrafford
Bird          0.08784866        2.0752341   1.7779659     0.576834     3.446005
Invertebrate  1.18420329        0.2628516   0.1685445     2.491889     3.176316
Other         0.23909136        0.7478374   2.1526708     1.026372     4.748458
Reptile       0.03283884        1.7457506   3.0945502    12.556638    21.350755
             genderMale
Bird          0.5453086
Invertebrate  0.6294311
Other         0.7769106
Reptile       0.5338600

Predictions and Diagnostics

Nominal logistic regression has a lot of similarities to binary logistic regression:

  • Multicollinearity still exists
  • Non-convergence problems still exists
  • AIC and BIC metrics can still be calculated
  • Generalized \(R^2\) remains the same

There are some inherent differences though between binary and nominal logistic regression. A lot of the diagnostics cannot be calculated for nominal logistic regression. ROC curves and residuals cannot typically be calculated because there is actually more than one logistic regression occurring.

Predicted probabilities are actually for each category though. Let’s see how we can get predicted probabilities from each of our softwares!

SAS

In PROC LOGISTIC we have the same process of getting predicted probabilities in nominal logistic regression as we did for binary and ordinal logistic regression. The OUTPUT statement will score the training set, while the SCORE statement will score the validation data set. The predprobs = I option is used to get individual predicted probabilities for each category.

ods html select none;
proc logistic data=Logistic.Gator plot(only)=oddsratio(range=clip);
    freq count;
    class lake(param=ref ref='George') size(param=ref ref='<= 2.3 meters') gender(param=ref ref='Male');
    model food(ref='Fish') = lake size gender / link=glogit clodds=pl;
    output out=pred predprobs=I;
    title 'Model on Alligator Food Choice';
run;
quit;
ods html select all;

proc print data=pred (obs = 5);
run;

proc freq data=pred;
    weight count;
    tables _FROM_*_INTO_;
    title 'Crosstabulation of Observed by Predicted Responses';
run;
Model on Alligator Food Choice

Obs size food lake gender count FROM INTO IP_Bird IP_Fish IP_Invertebrate IP_Other IP_Reptile
1 <= 2.3 meters Fish Hancock Male 7 Fish Fish 0.05115 0.60065 0.075459 0.24016 0.032587
2 <= 2.3 meters Invertebrate Hancock Male 1 Invertebrate Fish 0.05115 0.60065 0.075459 0.24016 0.032587
3 <= 2.3 meters Other Hancock Male 5 Other Fish 0.05115 0.60065 0.075459 0.24016 0.032587
4 > 2.3 meters Fish Hancock Male 4 Fish Fish 0.11023 0.62365 0.020592 0.18647 0.059057
5 > 2.3 meters Bird Hancock Male 1 Bird Fish 0.11023 0.62365 0.020592 0.18647 0.059057



Crosstabulation of Observed by Predicted Responses

Frequency
Percent
Row Pct
Col Pct
Table of FROM by INTO
FROM(Formatted Value
of the Observed Response)
INTO(Formatted Value of the Predicted
Response)
Fish Invertebrate Reptile Total
Bird
12
5.48
92.31
7.50
1
0.46
7.69
1.72
0
0.00
0.00
0.00
13
5.94
Fish
81
36.99
86.17
50.63
13
5.94
13.83
22.41
0
0.00
0.00
0.00
94
42.92
Invertebrate
29
13.24
47.54
18.13
31
14.16
50.82
53.45
1
0.46
1.64
100.00
61
27.85
Other
23
10.50
71.88
14.38
9
4.11
28.13
15.52
0
0.00
0.00
0.00
32
14.61
Reptile
15
6.85
78.95
9.38
4
1.83
21.05
6.90
0
0.00
0.00
0.00
19
8.68
Total
160
73.06
58
26.48
1
0.46
219
100.00

The IP variables are the individual predicted probabilities for each of the categories of our target variable. Unlike binary logistic regression where we typically did not use a cut-off of 0.5 to determine which category we predict, in nominal logistic regression we typically pick the category with the highest probability as our predicted category. The FROM and INTO variables are the actual target variable category and the predicted category for the target respectively.

To see how the predicted categories compare with the actual target categories we can use PROC FREQ with a TABLES statement on the FROM and INTO variables.

Accuracy is typically not as high in logistic regressions with more than two categories because there are more chances for incorrect predictions.

R

In R we have the same process of getting predicted probabilities in nominal logistic regression as we did for binary and ordinal logistic regression. The predict function will score any data set. The type = probs option is used to get individual predicted probabilities for each category.

       Fish       Bird Invertebrate     Other    Reptile
1 0.6006304 0.05115737   0.07545645 0.2401706 0.03258518
2 0.6006304 0.05115737   0.07545645 0.2401706 0.03258518
3 0.6006304 0.05115737   0.07545645 0.2401706 0.03258518
4 0.6236286 0.11022853   0.02059329 0.1864858 0.05906375
5 0.6236286 0.11022853   0.02059329 0.1864858 0.05906375
6 0.6236286 0.11022853   0.02059329 0.1864858 0.05906375

The five variables are the individual predicted probabilities for each of the categories of our target variable. Unlike binary logistic regression where we typically did not use a cut-off of 0.5 to determine which category we predict, in nominal logistic regression we typically pick the category with the highest probability as our predicted category.

Accuracy is typically not as high in logistic regressions with more than two categories because there are more chances for incorrect predictions.